ROCm

Developer(s)	AMD
Initial release	November 14, 2016; 7 years ago (2016-11-14)

Stable release	6.1.2 / June 4, 2024; 20 days ago (2024-06-04)^[1]

Repository	Meta-repository github.com/ROCm/ROCm
Written in	C, C++, Python, Fortran, Julia
Middleware	HIP
Engine	AMDgpu kernel driver, HIPCC, a LLVM-based compiler
Operating system	Linux, Windows^[2]
Platform	Supported GPUs
Predecessor	Close to metal, Stream, HSA
Size	<2 GiB
Type	GPGPU libraries and APIs
License	MIT License
Website	www.amd.com/en/products/software/rocm.html

ROCm^[3] is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP (GPU-kernel-based programming), OpenMP/Message Passing Interface (MPI) (directive-based programming), and OpenCL.

ROCm is free, libre and open-source software (except the GPU firmware blobs^[4]), and it is distributed under various licenses. ROCm initially stood for Radeon Open Compute platform; however, due to Open Compute being a registered trademark, ROCm is no longer an acronym — it is simply AMD's open-source stack designed for GPU compute.

Background

The first GPGPU software stack from ATI/AMD was Close to Metal, which became Stream.

ROCm was launched around 2016^[5] with the Boltzmann Initiative.^[6] ROCm stack builds upon previous AMD GPU stacks; some tools trace back to GPUOpen and others to the Heterogeneous System Architecture (HSA).

Heterogeneous System Architecture Intermediate Language

HSAIL^[7] was aimed at producing a middle-level, hardware-agnostic intermediate representation that could be JIT-compiled to the eventual hardware (GPU, FPGA...) using the appropriate finalizer. This approach was dropped for ROCm: now it builds only GPU code, using LLVM, and its AMDGPU backend that was upstreamed,^[8] although there is still research on such enhanced modularity with LLVM MLIR.^[9]

Programming abilities

This section needs expansion. You can help by adding to it. (January 2022)

ROCm as a stack ranges from the kernel driver to the end-user applications. AMD has introductory videos about AMD GCN hardware,^[10] and ROCm programming^[11] via its learning portal.^[12]

One of the best technical introductions about the stack and ROCm/HIP programming, remains, to date, to be found on Reddit.^[13]

Hardware support

ROCm is primarily targeted at discrete professional GPUs,^[14] but unofficial support includes the Vega family and RDNA 2 consumer GPUs.

Accelerated Processor Units (APU) are "enabled", but not officially supported. Having ROCm functional there is involved.^[15]

Professional-grade GPUs

AMD Instinct accelerators are the first-class ROCm citizens, alongside the prosumer Radeon Pro GPU series: they mostly see full support.

The only consumer-grade GPU that has relatively equal support is, as of January 2022, the Radeon VII (GCN 5 - Vega).

Consumer-grade GPUs

Name of GPU series	Southern Islands	Sea Islands	Volcanic Islands	Arctic Islands/Polaris	Vega	Navi 1X	Navi 2X
Released	Jan 2012	Sep 2013	Jun 2015	Jun 2016	Jun 2017	Jul 2019	Nov 2020
Marketing Name	Radeon HD 7000	Radeon Rx 200	Radeon Rx 300	Radeon RX 400/500	Radeon RX Vega/Radeon VII(7 nm)	Radeon RX 5000	Radeon RX 6000
AMD support
Instruction set	GCN instruction set					RDNA instruction set
Microarchitecture	GCN 1st gen	GCN 2nd gen	GCN 3rd gen	GCN 4th gen	GCN 5th gen	RDNA	RDNA 2
Type	Unified shader model
ROCm^[16]				^[17]		^[18]
OpenCL	1.2 (on Linux: 1.1 (no Image support) with Mesa 3D)	2.0 (Adrenalin driver on Win7+) (on Linux: 1.1 (no Image support) with Mesa 3D, 2.0 with AMD drivers or AMD ROCm)				2.0	2.1^[19]
Vulkan	1.0 (Win 7+ or Mesa 17+)	1.2 (Adrenalin 20.1, Linux Mesa 3D 20.0)
Shader model	5.1	5.1 6.3			6.4		6.5
OpenGL	4.6 (on Linux: 4.6 (Mesa 3D 20.0))
Direct3D	11 (11_1) 12 (11_1)	11 (12_0) 12 (12_0)			11 (12_1) 12 (12_1)		11 (12_1) 12 (12_2)
`/drm/amdgpu`^[a]	Experimental^[20]

^ DRM (Direct Rendering Manager) is a component of the Linux kernel.

Software ecosystem

Learning resources

This section needs expansion. You can help by adding to it. (January 2022)

AMD ROCm product manager Terry Deem gave a tour of the stack.^[21]

Third-party integration

The main consumers of the stack are machine learning and high-performance computing/GPGPU applications.

Machine learning

Various deep learning frameworks have a ROCm backend:^[22]

PyTorch
TensorFlow
ONNX
MXNet
CuPy^[23]
MIOpen
Caffe
Iree (which uses LLVM Multi-Level Intermediate Representation (MLIR))
llama.cpp

Supercomputing

ROCm is gaining significant traction in the top 500.^[24] ROCm is used with the Exascale supercomputers El Capitan^[25]^[26] and Frontier.

Some related software is to be found at AMD Infinity hub.

Other acceleration & graphics interoperation

As of version 3.0, Blender can now use HIP compute kernels for its renderer cycles.^[27]

Other Languages

Julia

Julia has the AMDGPU.jl package,^[28] which integrates with LLVM and selects components of the ROCm stack. Instead of compiling code through HIP, AMDGPU.jl uses Julia's compiler to generate LLVM IR directly, which is later consumed by LLVM to generate native device code. AMDGPU.jl uses ROCr's HSA implementation to upload native code onto the device and execute it, similar to how HIP loads its own generated device code.

AMDGPU.jl also supports integration with ROCm's rocBLAS (for BLAS), rocRAND (for random number generation), and rocFFT (for FFTs). Future integration with rocALUTION, rocSOLVER, MIOpen, and certain other ROCm libraries is planned.

Software distribution

Official

Installation instructions are provided for Linux and Windows in the official AMD ROCm documentation. ROCm software is currently spread across several public GitHub repositories. Within the main public meta-repository, there is an XML manifest for each official release: using git-repo, a version control tool built on top of Git, is the recommended way to synchronize with the stack locally.^[29]

AMD starts distributing containerized applications for ROCm, notably scientific research applications gathered under AMD Infinity Hub.^[30]

AMD distributes itself packages tailored to various Linux distributions.

Further information: AMD Radeon Software

Third-party

There is a growing third-party ecosystem packaging ROCm.

Linux distributions are officially packaging (natively) ROCm, with various degrees of advancement: Arch Linux,^[31] Gentoo,^[32] Debian, Fedora,^[33] GNU Guix, and NixOS.

There are spack packages.^[34]

Components

This section needs expansion. You can help by adding to it. (January 2022)

There is one kernel-space component, ROCk, and the rest - there is roughly a hundred components in the stack - is made of user-space modules.

The unofficial typographic policy is to use: uppercase ROC lowercase following for low-level libraries, i.e. ROCt, and the contrary for user-facing libraries, i.e. rocBLAS.^[35]

AMD is active developing with the LLVM community, but upstreaming is not instantaneous, and as of January 2022, is still lagging.^[36] AMD still officially packages various LLVM forks^[37]^[38]^[9] for parts that are not yet upstreamed – compiler optimizations destined to remain proprietary, debug support, OpenMP offloading, etc.

Low-level

ROCk – Kernel driver

Main article: AMDgpu (Linux kernel module)

ROCm – Device libraries

Support libraries implemented as LLVM bitcode. These provide various utilities and functions for math operations, atomics, queries for launch parameters, on-device kernel launch, etc.

ROCt – Thunk

The thunk is responsible for all the thinking and queuing that goes into the stack.

ROCr – Runtime

The ROC runtime is a set of APIs/libraries that allows the launch of compute kernals by host applications. It is AMD's implementation of the HSA runtime API.^[39] It is different from the ROC Common Language Runtime.

ROCm – CompilerSupport

ROCm code object manager is in charge of interacting with LLVM intermediate representation.

Mid-level

ROCclr Common Language Runtime

The common language runtime is an indirection layer adapting calls to ROCr on Linux and PAL on windows. It used to be able to route between different compilers, like the HSAIL-compiler. It is now being absorbed by the upper indirection layers (HIP and OpenCL).

OpenCL

Further information: OpenCL

ROCm ships its installable client driver (ICD) loader and an OpenCL^[40] implementation bundled together. As of January 2022, ROCm 4.5.2 ships OpenCL 2.2, and is lagging behind competition.^[41]

HIP – Heterogeneous Interface for Portability

The AMD implementation for its GPUs is called HIPAMD. There is also a CPU implementation mostly for demonstration purposes.

HIPCC

HIP builds a `HIPCC` compiler that either wraps Clang and compiles with LLVM open AMDGPU backend, or redirects to the NVIDIA compiler.^[42]

HIPIFY

HIPIFY is a source-to-source compiling tool. It translates CUDA to HIP and reverse, either using a Clang-based tool, or a sed-like Perl script.

GPUFORT

Like HIPIFY, GPUFORT is a tool compiling source code into other third-generation-language sources, allowing users to migrate from CUDA Fortran to HIP Fortran. It is also in the repertoire of research projects, even more so.^[43]

High-level

ROCm high-level libraries are usually consumed directly by application software, such as machine learning frameworks. Most of the following libraries are in the General Matrix Multiply (GEMM) category, which GPU architecture excels at.

The majority of these user-facing libraries comes in dual-form: hip for the indirection layer that can route to Nvidia hardware, and roc for the AMD implementation.^[44]

rocBLAS / hipBLAS

rocBLAS and hipBLAS are central in high-level libraries, it is the AMD implementation for Basic Linear Algebra Subprograms. It uses the library Tensile privately.

rocSOLVER / hipSOLVER

This pair of libraries constitutes the LAPACK implementation for ROCm and is strongly coupled to rocBLAS.

Utilities

ROCm developer tools: Debug, tracer, profiler, System Management Interface, Validation suite, Cluster management.
GPUOpen tools: GPU analyzer, memory visualizer...
External tools: radeontop (TUI overview)

Comparison with competitors

ROCm competes with other GPU computing stacks: Nvidia CUDA and Intel OneAPI.

Nvidia CUDA

Main article: CUDA

Nvidia's CUDA is closed-source, whereas AMD ROCm is open source. There is open-source software built on top of the closed-source CUDA, for instance RAPIDS.

CUDA is able run on consumer GPUs, whereas ROCm support is mostly offered for professional hardware such as AMD Instinct and AMD Radeon Pro.

Nvidia vendors the Clang frontend and its Parallel Thread Execution (PTX) LLVM GPU backend as the Nvidia CUDA Compiler (NVCC).

Intel OneAPI

Main article: OneAPI (compute acceleration)

Like ROCm, oneAPI is open source, and all the corresponding libraries are published on its GitHub Page.

Unified Acceleration Foundation (UXL)

Unified Acceleration Foundation (UXL) is a new technology consortium that are working on the contiuation of the OneAPI initiative, with the goal to create a new open standard accelerator software ecosystem, related open standards and specification projects through Working Groups and Special Interest Groups (SIGs). The goal will compete with Nvidia's CUDA. The main companies behind it are Intel, Google, ARM, Qualcomm, Samsung, Imagination, and VMware.^[45]

References

External links

AMD

Products

Architecture

Processors

Desktop	Geode Duron Sempron Turion Phenom Athlon FX Ryzen
Server	Opteron Epyc

Technologies

Graphics	Radeon AMD Radeon Software AMDGPU AMD PowerTune CrossFire Eyefinity FreeSync Mantle
Processor	AGESA AMD Turbo Core Cool'n'Quiet AMD Platform Security Processor Ryzen AI
Memory	High Bandwidth Memory (HBM)

Sockets

Sockets without existing articles (e.g. FP4) are omitted from this section.

Desktop

Pin grid array (PGA)	Super Socket 7 (Super 7) 939 AM2 AM2+ AM3 AM3+ FM1 FM2 FM2+ AM1 AM4
Land grid array (LGA)	TR4 sTRX4 AM5
Other	Slot A

Mobile

Pin grid array (PGA)	563 S1 FS1
Ball grid array (BGA)	FT1 FP2 FT3 FP3

Server

Pin grid array (PGA)	940
Land grid array (LGA)	F F+ G3 G34 C32 SP3

Mixed

Pin grid array (PGA)	Socket A (Socket 462) 754

Product lists

People

Founders	Jerry Sanders
CEOs	Jerry Sanders (1969–2002) Hector Ruiz (2002–2008) Dirk Meyer (2008–2011) Rory Read (2011–2014) Lisa Su (2014–present)

Acquisitions

Joint ventures

AMD–Chinese joint venture
- Hygon Information Technology

Litigation

Intel Corp. v. Advanced Micro Devices, Inc. (2004)
Advanced Micro Devices, Inc. v. Intel Corp. (2005)

Italics indicates an unreleased product (e.g. socket)
~~Strikethrough~~ indicates a product that was never released.
Mixed indicates sockets that are designed for or integrated with one or more platforms.

AMD technology

Software

Platforms

Current	ROCm GPUOpen
Obsolete	Spider Dragon Horus

Technology

Instructions

AMD graphics

Radeon-brand
List of GPUs (GPU features template) and List of APUs (APU features template)

Fixed pipeline

Wonder
Mach
Rage
All-in-Wonder (before 2000)

Vertex and fragment shaders

R100
R200
R300
R400
R500
All-in-Wonder (after 1999)

Unified shaders

TeraScale	HD 2000 HD 3000 HD 4000 HD 5000 HD 6000

Unified shaders & memory

GCN	HD 7000 HD 8000 200 300 400 500 RX Vega 600
RDNA	RX 5000 RX 6000 RX 7000

Current technologies and software

Audio/Video acceleration

GPU technologies

Software

Current	AMD Radeon Software HD3D ROCm AMDGPU GPU PerfStudio GPUOpen TressFX HLSL2GLSL
Obsolete	AMD APP SDK Catalyst Close to Metal CodeAnalyst Mantle CodeXL

Other brands and products

Workstations
& supercomputers

Current	Radeon Pro Radeon Instinct
Obsolete	FireGL/FirePro FireMV FireStream

Consoles
& handheld PCs

Graphics processing unit

GPU

Desktop	Intel Arc GT Nvidia GeForce Quadro Tesla Tegra AMD Radeon Radeon Pro Instinct Matrox InfiniteReality NEC µPD7220 3dfx Voodoo S3 Glaze3D Apple silicon
Mobile	Adreno Apple silicon Mali PowerVR VideoCore Vivante Imageon Intel 2700G

Architecture

Components

Memory

DMA
Framebuffer
SGRAM
- GDDR
- GDDR2
- GDDR3
- GDDR4
- GDDR5
- GDDR6
- GDDR7
HBM
- HBM2
- HBM2E
- HBM3
- HBM-PIM
- HBM3E
Memory bandwidth
Memory controller
Shared graphics memory
Texture memory
VRAM

Form factor

Performance

Misc

Parallel computing

Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Processor technologies

Models

Architecture

Instruction set
architectures

Types	Orthogonal instruction set CISC RISC Application-specific EDGE TRIPS VLIW EPIC MISC OISC NISC ZISC VISC architecture Quantum computing Comparison Addressing modes
Instruction sets	Motorola 68000 series VAX PDP-11 x86 ARM Stanford MIPS MIPS MIPS-X Power POWER PowerPC Power ISA Clipper architecture SPARC SuperH DEC Alpha ETRAX CRIS M32R Unicore Itanium OpenRISC RISC-V MicroBlaze LMC System/3x0 S/360 S/370 S/390 z/Architecture Tilera ISA VISC architecture Epiphany architecture Others

Execution

Instruction pipelining	Pipeline stall Operand forwarding Classic RISC pipeline
Hazards	Data dependency Structural Control False sharing
Out-of-order	Scoreboarding Tomasulo's algorithm Reservation station Re-order buffer Register renaming Wide-issue
Speculative	Branch prediction Memory dependence prediction

Parallelism

Level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory Distributed
Multithreading	Temporal Simultaneous Hyperthreading Simultaneous and heterogenous Speculative Preemptive Cooperative
Flynn's taxonomy	SISD SIMD Array processing (SIMT) Pipelined processing Associative processing SWAR MISD MIMD SPMD

Processor
performance

Transistor count
Instructions per cycle (IPC)
- Cycles per instruction (CPI)
Instructions per second (IPS)
Floating-point operations per second (FLOPS)
Transactions per second (TPS)
Synaptic updates per second (SUPS)
Performance per watt (PPW)
Cache performance metrics
Computer performance by orders of magnitude

Types

By application	Embedded system Microprocessor Microcontroller Mobile Ultra-low-voltage ASIP Soft microprocessor
Systems on chip	System on a chip (SoC) Multiprocessor (MPSoC) Cypress PSoC Network on a chip (NoC)
Hardware accelerators	Coprocessor AI accelerator Graphics processing unit (GPU) Image processor Vision processing unit (VPU) Physics processing unit (PPU) Digital signal processor (DSP) Tensor Processing Unit (TPU) Secure cryptoprocessor Network processor Baseband processor

Word size

Core count

Components

Functional units	Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Memory management unit (MMU) Load–store unit Translation lookaside buffer (TLB) Branch predictor Branch target predictor Integrated memory controller (IMC) Memory management unit Instruction decoder
Logic	Combinational Sequential Glue Logic gate Quantum Array
Registers	Processor register Status register Stack register Register file Memory buffer Memory address register Program counter
Control unit	Hardwired control unit Instruction unit Data buffer Write buffer Microcode ROM Counter
Datapath	Multiplexer Demultiplexer Adder Multiplier CPU Binary decoder Address decoder Sum-addressed decoder Barrel shifter
Circuitry	Integrated circuit 3D Mixed-signal Power management Boolean Digital Analog Quantum Switch

Power
management

Numerical linear algebra
Key concepts	Floating point Numerical stability
Problems	System of linear equations Matrix decompositions Matrix multiplication (algorithms) Matrix splitting Sparse problems
Hardware	CPU cache TLB Cache-oblivious algorithm SIMD Multiprocessing
Software	ATLAS MATLAB Basic Linear Algebra Subprograms (BLAS) LAPACK Specialized libraries General purpose software

More

Background

Heterogeneous System Architecture Intermediate Language

Programming abilities

Hardware support

Professional-grade GPUs

Consumer-grade GPUs

Software ecosystem

Learning resources

Third-party integration

Machine learning

Supercomputing

Other acceleration & graphics interoperation

Other Languages

Julia

Software distribution

Official

Third-party

Components

Low-level

ROCk – Kernel driver

ROCm – Device libraries

ROCt – Thunk

ROCr – Runtime

ROCm – CompilerSupport

Mid-level

ROCclr Common Language Runtime

OpenCL

HIP – Heterogeneous Interface for Portability

HIPCC

HIPIFY

GPUFORT

High-level

rocBLAS / hipBLAS

rocSOLVER / hipSOLVER

Utilities

Comparison with competitors

Nvidia CUDA

Intel OneAPI

Unified Acceleration Foundation (UXL)

See also

References

External links