AVX512 are 512bit extensions to the 256bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and first implemented in the 2016 Intel Xeon Phi x200 (Knights Landing),^{[1]} and then later in a number of AMD and other Intel CPUs (see list below). AVX512 consists of multiple extensions that may be implemented independently.^{[2]} This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX512F (AVX512 Foundation) is required by all AVX512 implementations.
Besides widening most 256bit instructions, the extensions introduce various new operations, such as new data conversions, scatter operations, and permutations.^{[2]} The number of AVX registers is increased from 16 to 32, and eight new "mask registers" are added, which allow for variable selection and blending of the results of instructions. In CPUs with the vector length (VL) extension—included in most AVX512capable processors (see § CPUs with AVX512)—these instructions may also be used on the 128bit and 256bit vector sizes. AVX512 is not the first 512bit SIMD instruction set that Intel has introduced in processors: the earlier 512bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.^{[1]}
The AVX512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.
The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX512. This has led them to define a new prefix called EVEX.
Compared to VEX, EVEX adds the following benefits:^{[6]}
The extended registers, SIMD width bit, and opmask registers of AVX512 are mandatory and all require support from the OS.
The AVX512 instructions are designed to mix with 128/256bit AVX/AVX2 instructions without a performance penalty. However, AVX512VL extensions allows the use of AVX512 instructions on 128/256bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX512F only works on 32 and 64bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX512BW extension (byte & word support).^{[6]}
Name  Extension sets  Registers  Types 

Legacy SSE  SSE–SSE4.2  xmm0–xmm15  single floats. From SSE2: bytes, words, doublewords, quadwords and double floats. 
AVX128 (VEX)  AVX, AVX2  xmm0–xmm15  bytes, words, doublewords, quadwords, single floats and double floats. 
AVX256 (VEX)  AVX, AVX2  ymm0–ymm15  single float and double float. From AVX2: bytes, words, doublewords, quadwords. 
AVX128 (EVEX)  AVX512VL  xmm0–xmm31 (k0–k7)  doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512FP16: half float. 
AVX256 (EVEX)  AVX512VL  ymm0–ymm31 (k0–k7)  doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512FP16: half float. 
AVX512 (EVEX)  AVX512F  zmm0–zmm31 (k0–k7)  doublewords, quadwords, single float and double float. With AVX512BW: bytes and words. With AVX512FP16: half float. 
511 256  255 128  127 0 
ZMM0  YMM0  XMM0 
ZMM1  YMM1  XMM1 
ZMM2  YMM2  XMM2 
ZMM3  YMM3  XMM3 
ZMM4  YMM4  XMM4 
ZMM5  YMM5  XMM5 
ZMM6  YMM6  XMM6 
ZMM7  YMM7  XMM7 
ZMM8  YMM8  XMM8 
ZMM9  YMM9  XMM9 
ZMM10  YMM10  XMM10 
ZMM11  YMM11  XMM11 
ZMM12  YMM12  XMM12 
ZMM13  YMM13  XMM13 
ZMM14  YMM14  XMM14 
ZMM15  YMM15  XMM15 
ZMM16  YMM16  XMM16 
ZMM17  YMM17  XMM17 
ZMM18  YMM18  XMM18 
ZMM19  YMM19  XMM19 
ZMM20  YMM20  XMM20 
ZMM21  YMM21  XMM21 
ZMM22  YMM22  XMM22 
ZMM23  YMM23  XMM23 
ZMM24  YMM24  XMM24 
ZMM25  YMM25  XMM25 
ZMM26  YMM26  XMM26 
ZMM27  YMM27  XMM27 
ZMM28  YMM28  XMM28 
ZMM29  YMM29  XMM29 
ZMM30  YMM30  XMM30 
ZMM31  YMM31  XMM31 
The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16XMM31 and YMM16YMM31 when using EVEX encoded form.
Most AVX512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register 'k0' is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, 'k0' is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.
The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX512BW extension.^{[6]} How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32bit single float or double words, 16 bits are used to mask the 16 elements in a 512bit register. For double float and quad words, at most 8 mask bits are used.
The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX512. For instance, bitwise AND, OR or 128bit shuffle now exist in both doubleword and quadword variants with the only difference being in the final masking.
The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16bit (Word) versions. With AVX512DQ 8bit (Byte) versions were added to better match the needs of masking 8 64bit values, and with AVX512BW 32bit (Double) and 64bit (Quad) versions were added so they can mask up to 64 8bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with nonSIMD x86 branch and conditional instructions.
Instruction  Extension set  Description 

KAND

F  Bitwise logical AND Masks 
KANDN

F  Bitwise logical AND NOT Masks 
KMOV

F  Move from and to Mask Registers or General Purpose Registers 
KUNPCK

F  Unpack for Mask Registers 
KNOT

F  NOT Mask Register 
KOR

F  Bitwise logical OR Masks 
KORTEST

F  OR Masks And Set Flags 
KSHIFTL

F  Shift Left Mask Registers 
KSHIFTR

F  Shift Right Mask Registers 
KXNOR

F  Bitwise logical XNOR Masks 
KXOR

F  Bitwise logical XOR Masks 
KADD

BW/DQ  Add Two Masks 
KTEST

BW/DQ  Bitwise comparison and set flags 
Many AVX512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX512 versions. The new or heavily reworked instructions are listed below. These foundation instructions also include the extensions from AVX512VL and AVX512BW since those extensions merely add new versions of these instructions instead of new instructions.
There are no EVEXprefixed versions of the blend instructions from SSE4; instead, AVX512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.
Since blending is an integral part of the EVEX encoding, these instructions may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.
Instruction  Extension set  Description 

VBLENDMPD

F  Blend float64 vectors using opmask control 
VBLENDMPS

F  Blend float32 vectors using opmask control 
VPBLENDMD

F  Blend int32 vectors using opmask control 
VPBLENDMQ

F  Blend int64 vectors using opmask control 
VPBLENDMB

BW  Blend byte integer vectors using opmask control 
VPBLENDMW

BW  Blend word integer vectors using opmask control 
AVX512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.^{[6]}
Immediate  Comparison  Description 

0  EQ  Equal 
1  LT  Less than 
2  LE  Less than or equal 
3  FALSE  Set to zero 
4  NEQ  Not equal 
5  NLT  Greater than or equal 
6  NLE  Greater than 
7  TRUE  Set to one 
Instruction  Extension set  Description 

VPCMPD , VPCMPUD

F  Compare signed/unsigned doublewords into mask 
VPCMPQ , VPCMPUQ

F  Compare signed/unsigned quadwords into mask 
VPCMPB , VPCMPUB

BW  Compare signed/unsigned bytes into mask 
VPCMPW , VPCMPUW

BW  Compare signed/unsigned words into mask 
The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or nonzero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.
Instruction  Extension set  Description 

VPTESTMD , VPTESTMQ

F  Logical AND and set mask for 32 or 64 bit integers. 
VPTESTNMD , VPTESTNMQ

F  Logical NAND and set mask for 32 or 64 bit integers. 
VPTESTMB , VPTESTMW

BW  Logical AND and set mask for 8 or 16 bit integers. 
VPTESTNMB , VPTESTNMW

BW  Logical NAND and set mask for 8 or 16 bit integers. 
The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.
Instruction  Description 

VCOMPRESSPD , VCOMPRESSPS

Store sparse packed double/singleprecision floatingpoint values into dense memory 
VPCOMPRESSD , VPCOMPRESSQ

Store sparse packed doubleword/quadword integer values into dense memory/register 
VEXPANDPD , VEXPANDPS

Load sparse packed double/singleprecision floatingpoint values from dense memory 
VPEXPANDD , VPEXPANDQ

Load sparse packed doubleword/quadword integer values from dense memory/register 
A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX512BW extends the instructions to also include 16bit (word) versions, and the AVX512_VBMI extension defines the byte versions of the instructions.
Instruction  Extension set  Description 

VPERMB

VBMI  Permute packed bytes elements. 
VPERMW

BW  Permute packed words elements. 
VPERMT2B

VBMI  Full byte permute overwriting first source. 
VPERMT2W

BW  Full word permute overwriting first source. 
VPERMI2PD , VPERMI2PS

F  Full single/double floatingpoint permute overwriting the index. 
VPERMI2D , VPERMI2Q

F  Full doubleword/quadword permute overwriting the index. 
VPERMI2B

VBMI  Full byte permute overwriting the index. 
VPERMI2W

BW  Full word permute overwriting the index. 
VPERMT2PS , VPERMT2PD

F  Full single/double floatingpoint permute overwriting first source. 
VPERMT2D , VPERMT2Q

F  Full doubleword/quadword permute overwriting first source. 
VSHUFF32x4 , VSHUFF64x2 , VSHUFI32x4 , VSHUFI64x2

F  Shuffle four packed 128bit lines. 
VPMULTISHIFTQB

VBMI  Select packed unaligned bytes from quadword sources. 
Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.^{[6]} These are the only bitwise vector instructions in AVX512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX512DQ.
The difference in the doubleword and quadword versions is only the application of the opmask.
Instruction  Description 

VPTERNLOGD , VPTERNLOGQ

Bitwise Ternary Logic 
A0  A1  A2  Double AND (0x80) 
Double OR (0xFE) 
Bitwise blend (0xCA) 

0  0  0  0  0  0 
0  0  1  0  1  1 
0  1  0  0  1  0 
0  1  1  0  1  1 
1  0  0  0  1  0 
1  0  1  0  1  0 
1  1  0  0  1  1 
1  1  1  1  1  1 
A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.
Instruction  Extension set  Description 


F  Down convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1. 
VPMOVWB , VPMOVSWB , VPMOVUSWB

BW  Down convert word to byte; unsaturated, saturated or saturated unsigned. 
VCVTPS2UDQ , VCVTPD2UDQ , VCVTTPS2UDQ , VCVTTPD2UDQ

F  Convert with or without truncation, packed single or doubleprecision floating point to packed unsigned doubleword integers. 
VCVTSS2USI , VCVTSD2USI , VCVTTSS2USI , VCVTTSD2USI

F  Convert with or without truncation, scalar single or doubleprecision floating point to unsigned doubleword integer. 
VCVTPS2QQ , VCVTPD2QQ , VCVTPS2UQQ , VCVTPD2UQQ , VCVTTPS2QQ , VCVTTPD2QQ , VCVTTPS2UQQ , VCVTTPD2UQQ

DQ  Convert with or without truncation, packed single or doubleprecision floating point to packed signed or unsigned quadword integers. 
VCVTUDQ2PS , VCVTUDQ2PD

F  Convert packed unsigned doubleword integers to packed single or doubleprecision floating point. 
VCVTUSI2PS , VCVTUSI2PD

F  Convert scalar unsigned doubleword integers to single or doubleprecision floating point. 
VCVTUSI2SD , VCVTUSI2SS

F  Convert scalar unsigned integers to single or doubleprecision floating point. 
VCVTUQQ2PS , VCVTUQQ2PD

DQ  Convert packed unsigned quadword integers to packed single or doubleprecision floating point. 
VCVTQQ2PD , VCVTQQ2PS

F  Convert packed quadword integers to packed single or doubleprecision floating point. 
Among the unique new features in AVX512F are instructions to decompose floatingpoint values and handle special floatingpoint values. Since these methods are completely new, they also exist in scalar versions.
Instruction  Description 

VGETEXPPD , VGETEXPPS

Convert exponents of packed fp values into fp values 
VGETEXPSD , VGETEXPSS

Convert exponent of scalar fp value into fp value 
VGETMANTPD , VGETMANTPS

Extract vector of normalized mantissas from float32/float64 vector 
VGETMANTSD , VGETMANTSS

Extract float32/float64 of normalized mantissa from float32/float64 scalar 
VFIXUPIMMPD , VFIXUPIMMPS

Fix up special packed float32/float64 values 
VFIXUPIMMSD , VFIXUPIMMSS

Fix up special scalar float32/float64 value 
This is the second set of new floatingpoint methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2^{−14}.^{[6]}
Instruction  Description 

VRCP14PD , VRCP14PS

Compute approximate reciprocals of packed float32/float64 values 
VRCP14SD , VRCP14SS

Compute approximate reciprocals of scalar float32/float64 value 
VRNDSCALEPS , VRNDSCALEPD

Round packed float32/float64 values to include a given number of fraction bits 
VRNDSCALESS , VRNDSCALESD

Round scalar float32/float64 value to include a given number of fraction bits 
VRSQRT14PD , VRSQRT14PS

Compute approximate reciprocals of square roots of packed float32/float64 values 
VRSQRT14SD , VRSQRT14SS

Compute approximate reciprocal of square root of scalar float32/float64 value 
VSCALEFPS , VSCALEFPD

Scale packed float32/float64 values with float32/float64 values 
VSCALEFSS , VSCALEFSD

Scale scalar float32/float64 value with float32/float64 value 
Instruction  Extension set  Description 

VBROADCASTSS , VBROADCASTSD

F, VL  Broadcast single/double floatingpoint value 
VPBROADCASTB , VPBROADCASTW , VPBROADCASTD , VPBROADCASTQ

F, VL, DQ, BW  Broadcast a byte/word/doubleword/quadword integer value 
VBROADCASTI32X2 , VBROADCASTI64X2 , VBROADCASTI32X4 , VBROADCASTI32X8 , VBROADCASTI64X4

F, VL, DQ, BW  Broadcast two or four doubleword/quadword integer values 
Instruction  Extension set  Description 

VALIGND , VALIGNQ

F, VL  Align doubleword or quadword vectors 
VDBPSADBW

BW  Double block packed sumabsolutedifferences (SAD) on unsigned bytes 
VPABSQ

F  Packed absolute value quadword 
VPMAXSQ , VPMAXUQ

F  Maximum of packed signed/unsigned quadword 
VPMINSQ , VPMINUQ

F  Minimum of packed signed/unsigned quadword 
VPROLD , VPROLVD , VPROLQ , VPROLVQ , VPRORD , VPRORVD , VPRORQ , VPRORVQ

F  Bit rotate left or right 
VPSCATTERDD , VPSCATTERDQ , VPSCATTERQD , VPSCATTERQQ

F  Scatter packed doubleword/quadword with signed doubleword and quadword indices 
VSCATTERDPS , VSCATTERDPD , VSCATTERQPS , VSCATTERQPD

F  Scatter packed float32/float64 with signed doubleword and quadword indices 
The instructions in AVX512 conflict detection (AVX512CD) are designed to help efficiently calculate conflictfree subsets of elements in loops that normally could not be safely vectorized.^{[8]}
Instruction  Name  Description 

VPCONFLICTD , VPCONFLICTQ

Detect conflicts within vector of packed double or quadwords values  Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results 
VPLZCNTD , VPLZCNTQ

Count the number of leading zero bits for packed double or quadword values  Vectorized LZCNT instruction

VPBROADCASTMB2Q , VPBROADCASTMW2D

Broadcast mask to vector register  Either 8bit mask to quadword vector, or 16bit mask to doubleword vector 
AVX512 exponential and reciprocal (AVX512ER) instructions contain more accurate approximate reciprocal instructions than those in the AVX512 foundation; relative error is at most 2^{−28}. They also contain two new exponential functions that have a relative error of at most 2^{−23}.^{[6]}
Instruction  Description 

VEXP2PD , VEXP2PS

Compute approximate exponential 2^x of packed single or doubleprecision floatingpoint values 
VRCP28PD , VRCP28PS

Compute approximate reciprocals of packed single or doubleprecision floatingpoint values 
VRCP28SD , VRCP28SS

Compute approximate reciprocal of scalar single or doubleprecision floatingpoint value 
VRSQRT28PD , VRSQRT28PS

Compute approximate reciprocals of square roots of packed single or doubleprecision floatingpoint values 
VRSQRT28SD , VRSQRT28SS

Compute approximate reciprocal of square root of scalar single or doubleprecision floatingpoint value 
AVX512 prefetch (AVX512PF) instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX512. T0
prefetch means prefetching into level 1 cache and T1
means prefetching into level 2 cache.
Instruction  Description 

VGATHERPF0DPS , VGATHERPF0QPS , VGATHERPF0DPD , VGATHERPF0QPD

Using signed dword/qword indices, prefetch sparse byte memory locations containing single/doubleprecision data using opmask k1 and T0 hint. 
VGATHERPF1DPS , VGATHERPF1QPS , VGATHERPF1DPD , VGATHERPF1QPD

Using signed dword/qword indices, prefetch sparse byte memory locations containing single/doubleprecision data using opmask k1 and T1 hint. 
VSCATTERPF0DPS , VSCATTERPF0QPS , VSCATTERPF0DPD , VSCATTERPF0QPD

Using signed dword/qword indices, prefetch sparse byte memory locations containing single/doubleprecision data using writemask k1 and T0 hint with intent to write. 
VSCATTERPF1DPS , VSCATTERPF1QPS , VSCATTERPF1DPD , VSCATTERPF1QPD

Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write. 
The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.
Instruction  Extension set  Description 

V4FMADDPS , V4FMADDSS

4FMAPS  Packed/scalar singleprecision floatingpoint fused multiplyadd (4iterations) 
V4FNMADDPS , V4FNMADDSS

4FMAPS  Packed/scalar singleprecision floatingpoint fused multiplyadd and negate (4iterations) 
VP4DPWSSD

4VNNIW  Dot product of signed words with double word accumulation (4iterations) 
VP4DPWSSDS

4VNNIW  Dot product of signed words with double word accumulation and saturation (4iterations) 
AVX512DQ adds new doubleword and quadword instructions. AVX512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX512F. A few instructions which get only word forms with AVX512BW acquire byte forms with the AVX512_VBMI extension (VPERMB
, VPERMI2B
, VPERMT2B
, VPMULTISHIFTQB
).
Two new instructions were added to the mask instructions set: KADD
and KTEST
(B and W forms with AVX512DQ, D and Q with AVX512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX512DQ and doubleword/quadword forms with AVX512BW. KUNPCKBW
was extended to KUNPCKWD
and KUNPCKDQ
by AVX512BW.
Among the instructions added by AVX512DQ are several SSE and AVX instructions that didn't get AVX512 versions with AVX512F, among those are all the two input bitwise instructions and extract/insert integer instructions.
Instructions that are completely new are covered below.
Three new floatingpoint operations are introduced. Since they are not only new to AVX512 they have both packed/SIMD and scalar versions.
The VFPCLASS
instructions tests if the floatingpoint value is one of eight special floatingpoint values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE
instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE
instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.
Instruction  Extension set  Description 

VFPCLASSPS , VFPCLASSPD

DQ  Test types of packed single and double precision floatingpoint values. 
VFPCLASSSS , VFPCLASSSD

DQ  Test types of scalar single and double precision floatingpoint values. 
VRANGEPS , VRANGEPD

DQ  Range restriction calculation for packed floatingpoint values. 
VRANGESS , VRANGESD

DQ  Range restriction calculation for scalar floatingpoint values. 
VREDUCEPS , VREDUCEPD

DQ  Perform reduction transformation on packed floatingpoint values. 
VREDUCESS , VREDUCESD

DQ  Perform reduction transformation on scalar floatingpoint values. 
Instruction  Extension set  Description 

VPMOVM2D , VPMOVM2Q

DQ  Convert mask register to double or quadword vector register. 
VPMOVM2B , VPMOVM2W

BW  Convert mask register to byte or word vector register. 
VPMOVD2M , VPMOVQ2M

DQ  Convert double or quadword vector register to mask register. 
VPMOVB2M , VPMOVW2M

BW  Convert byte or word vector register to mask register. 
VPMULLQ

DQ  Multiply packed quadword store low result. A quadword version of VPMULLD. 
Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.
Instruction  Description 

VPCOMPRESSB , VPCOMPRESSW

Store sparse packed byte/word integer values into dense memory/register 
VPEXPANDB , VPEXPANDW

Load sparse packed byte/word integer values from dense memory/register 
VPSHLD

Concatenate and shift packed data left logical 
VPSHLDV

Concatenate and variable shift packed data left logical 
VPSHRD

Concatenate and shift packed data right logical 
VPSHRDV

Concatenate and variable shift packed data right logical 
Vector Neural Network Instructions:^{[9]} AVX512VNNI adds EVEXcoded instructions described below. With AVX512F, these instructions can operate on 512bit vectors, and AVX512VL further adds support for 128 and 256bit vectors.
A later AVXVNNI extension adds VEX encodings of these instructions which can only operate on 128 or 256bit vectors. AVXVNNI is not part of the AVX512 suite, it does not require AVX512F and can be implemented independently.
Instruction  Description 

VPDPBUSD

Multiply and add unsigned and signed bytes 
VPDPBUSDS

Multiply and add unsigned and signed bytes with saturation 
VPDPWSSD

Multiply and add signed word integers 
VPDPWSSDS

Multiply and add word integers with saturation 
Integer fused multiplyadd instructions. AVX512IFMA adds EVEXcoded instructions described below.
A separate AVXIFMA instruction set extension defines VEX encoding of these instructions. This extension is not part of the AVX512 suite and can be implemented independently.
Instruction  Extension set  Description 

VPMADD52LUQ

IFMA  Packed multiply of unsigned 52bit integers and add the low 52bit products to 64bit accumulators 
VPMADD52HUQ

IFMA  Packed multiply of unsigned 52bit integers and add the high 52bit products to 64bit accumulators 
Instruction  Extension set  Description 

VPOPCNTD , VPOPCNTQ

VPOPCNTDQ  Return the number of bits set to 1 in doubleword/quadword 
VPOPCNTB , VPOPCNTW

BITALG  Return the number of bits set to 1 in byte/word 
VPSHUFBITQMB

BITALG  Shuffle bits from quadword elements using byte indexes into mask 
Instruction  Extension set  Description 

VP2INTERSECTD , VP2INTERSECTQ

VP2INTERSECT  Compute intersection between doublewords/quadwords to a pair of mask registers 
Galois field new instructions are useful for cryptography,^{[10]} as they can be used to implement Rijndaelstyle Sboxes such as those used in AES, Camellia, and SM4.^{[11]} These instructions may also be used for bit manipulation in networking and signal processing.^{[10]}
GFNI is a standalone instruction set extension and can be enabled separately from AVX or AVX512. Depending on whether AVX and AVX512F support is indicated by the CPU, GFNI support enables legacy (SSE), VEX or EVEXcoded instructions operating on 128, 256 or 512bit vectors.
Instruction  Description 

VGF2P8AFFINEINVQB

Galois field affine transformation inverse 
VGF2P8AFFINEQB

Galois field affine transformation 
VGF2P8MULB

Galois field multiply bytes 
VPCLMULQDQ with AVX512F adds an EVEXencoded 512bit version of the PCLMULQDQ instruction. With AVX512VL, it adds EVEXencoded 256 and 128bit versions. VPCLMULQDQ alone (that is, on nonAVX512 CPUs) adds only VEXencoded 256bit version. (Availability of the VEXencoded 128bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128bit variations of the instruction perform the same operation on each 128bit portion of input registers, but they do not extend it to select quadwords from different 128bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128bit field is selected).
Instruction  Description 

VPCLMULQDQ

Carryless multiplication quadword 
VEX and EVEXencoded AES instructions. The wider than 128bit variations of the instruction perform the same operation on each 128bit portion of input registers. The VEX versions can be used without AVX512 support.
Instruction  Description 

VAESDEC

Perform one round of an AES decryption flow 
VAESDECLAST

Perform last round of an AES decryption flow 
VAESENC

Perform one round of an AES encryption flow 
VAESENCLAST

Perform last round of an AES encryption flow 
AI acceleration instructions operating on the Bfloat16 numbers.
Instruction  Description 

VCVTNE2PS2BF16

Convert two vectors of packed single precision numbers into one vector of packed Bfloat16 numbers 
VCVTNEPS2BF16

Convert one vector of packed single precision numbers to one vector of packed Bfloat16 numbers 
VDPBF16PS

Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number 
An extension of the earlier F16C instruction set, adding comprehensive support for the binary16 floatingpoint numbers (also known as FP16, float16 or halfprecision floatingpoint numbers). The new instructions implement most operations that were previously available for single and doubleprecision floatingpoint numbers and also introduce new complex number instructions and conversion instructions. Scalar and packed operations are supported.
Unlike the single and doubleprecision format instructions, the halfprecision operands are neither conditionally flushed to zero (FTZ) nor conditionally treated as zero (DAZ) based on MXCSR
settings. Subnormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ
bit.^{[12]}
Instruction  Description 

VADDPH , VADDSH

Add packed/scalar FP16 numbers. 
VSUBPH , VSUBSH

Subtract packed/scalar FP16 numbers. 
VMULPH , VMULSH

Multiply packed/scalar FP16 numbers. 
VDIVPH , VDIVSH

Divide packed/scalar FP16 numbers. 
VSQRTPH , VSQRTSH

Compute square root of packed/scalar FP16 numbers. 
VFMADD{132, 213, 231}PH , VFMADD{132, 213, 231}SH

Multiplyadd packed/scalar FP16 numbers. 
VFNMADD{132, 213, 231}PH , VFNMADD{132, 213, 231}SH

Negated multiplyadd packed/scalar FP16 numbers. 
VFMSUB{132, 213, 231}PH , VFMSUB{132, 213, 231}SH

Multiplysubtract packed/scalar FP16 numbers. 
VFNMSUB{132, 213, 231}PH , VFNMSUB{132, 213, 231}SH

Negated multiplysubtract packed/scalar FP16 numbers. 
VFMADDSUB{132, 213, 231}PH

Multiplyadd (odd vector elements) or multiplysubtract (even vector elements) packed FP16 numbers. 
VFMSUBADD{132, 213, 231}PH

Multiplysubtract (odd vector elements) or multiplyadd (even vector elements) packed FP16 numbers. 
VREDUCEPH , VREDUCESH

Perform reduction transformation of the packed/scalar FP16 numbers. 
VRNDSCALEPH , VRNDSCALESH

Round packed/scalar FP16 numbers to a given number of fraction bits. 
VSCALEFPH , VSCALEFSH

Scale packed/scalar FP16 numbers by multiplying it by a power of two. 
Instruction  Description 

VFMULCPH , VFMULCSH

Multiply packed/scalar complex FP16 numbers. 
VFCMULCPH , VFCMULCSH

Multiply packed/scalar complex FP16 numbers. Complex conjugate form of the operation. 
VFMADDCPH , VFMADDCSH

Multiplyadd packed/scalar complex FP16 numbers. 
VFCMADDCPH , VFCMADDCSH

Multiplyadd packed/scalar complex FP16 numbers. Complex conjugate form of the operation. 
Instruction  Description 

VRCPPH , VRCPSH

Compute approximate reciprocal of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2^{−11}+2^{−14}. 
VRSQRTPH , VRSQRTSH

Compute approximate reciprocal square root of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2^{−14}. 
Instruction  Description 

VCMPPH , VCMPSH

Compare the packed/scalar FP16 numbers and store the result in a mask register. 
VCOMISH

Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception if a source operand is QNaN or SNaN. 
VUCOMISH

Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception only if a source operand is SNaN. 
VMAXPH , VMAXSH

Select the maximum of each vertical pair of the source packed/scalar FP16 numbers. 
VMINPH , VMINSH

Select the minimum of each vertical pair of the source packed/scalar FP16 numbers. 
VFPCLASSPH , VFPCLASSSH

Test packed/scalar FP16 numbers for special categories (NaN, infinity, negative zero, etc.) and store the result in a mask register. 
Instruction  Description 

VCVTW2PH

Convert packed signed 16bit integers to FP16 numbers. 
VCVTUW2PH

Convert packed unsigned 16bit integers to FP16 numbers. 
VCVTDQ2PH

Convert packed signed 32bit integers to FP16 numbers. 
VCVTUDQ2PH

Convert packed unsigned 32bit integers to FP16 numbers. 
VCVTQQ2PH

Convert packed signed 64bit integers to FP16 numbers. 
VCVTUQQ2PH

Convert packed unsigned 64bit integers to FP16 numbers. 
VCVTPS2PHX

Convert packed FP32 numbers to FP16 numbers. Unlike VCVTPS2PH from F16C, VCVTPS2PHX has a different encoding that also supports broadcasting.

VCVTPD2PH

Convert packed FP64 numbers to FP16 numbers. 
VCVTSI2SH

Convert a scalar signed 32bit or 64bit integer to FP16 number. 
VCVTUSI2SH

Convert a scalar unsigned 32bit or 64bit integer to FP16 number. 
VCVTSS2SH

Convert a scalar FP32 number to FP16 number. 
VCVTSD2SH

Convert a scalar FP64 number to FP16 number. 
VCVTPH2W , VCVTTPH2W

Convert packed FP16 numbers to signed 16bit integers. VCVTPH2W rounds the value according to the MXCSR register. VCVTTPH2W rounds toward zero.

VCVTPH2UW , VCVTTPH2UW

Convert packed FP16 numbers to unsigned 16bit integers. VCVTPH2UW rounds the value according to the MXCSR register. VCVTTPH2UW rounds toward zero.

VCVTPH2DQ , VCVTTPH2DQ

Convert packed FP16 numbers to signed 32bit integers. VCVTPH2DQ rounds the value according to the MXCSR register. VCVTTPH2DQ rounds toward zero.

VCVTPH2UDQ , VCVTTPH2UDQ

Convert packed FP16 numbers to unsigned 32bit integers. VCVTPH2UDQ rounds the value according to the MXCSR register. VCVTTPH2UDQ rounds toward zero.

VCVTPH2QQ , VCVTTPH2QQ

Convert packed FP16 numbers to signed 64bit integers. VCVTPH2QQ rounds the value according to the MXCSR register. VCVTTPH2QQ rounds toward zero.

VCVTPH2UQQ , VCVTTPH2UQQ

Convert packed FP16 numbers to unsigned 64bit integers. VCVTPH2UQQ rounds the value according to the MXCSR register. VCVTTPH2UQQ rounds toward zero.

VCVTPH2PSX

Convert packed FP16 numbers to FP32 numbers. Unlike VCVTPH2PS from F16C, VCVTPH2PSX has a different encoding that also supports broadcasting.

VCVTPH2PD

Convert packed FP16 numbers to FP64 numbers. 
VCVTSH2SI , VCVTTSH2SI

Convert a scalar FP16 number to signed 32bit or 64bit integer. VCVTSH2SI rounds the value according to the MXCSR register. VCVTTSH2SI rounds toward zero.

VCVTSH2USI , VCVTTSH2USI

Convert a scalar FP16 number to unsigned 32bit or 64bit integer. VCVTSH2USI rounds the value according to the MXCSR register. VCVTTSH2USI rounds toward zero.

VCVTSH2SS

Convert a scalar FP16 number to FP32 number. 
VCVTSH2SD

Convert a scalar FP16 number to FP64 number. 
Instruction  Description 

VGETEXPPH , VGETEXPSH

Extract exponent components of packed/scalar FP16 numbers as FP16 numbers. 
VGETMANTPH , VGETMANTSH

Extract mantissa components of packed/scalar FP16 numbers as FP16 numbers. 
Instruction  Description 

VMOVSH

Move scalar FP16 number to/from memory or between vector registers. 
VMOVW

Move scalar FP16 number to/from memory or general purpose register. 
Group  Legacy encoding  Instructions  AVX512 extensions  

SSE SSE2 MMX 
AVX SSE3 SSE4 
AVX2 FMA 
F  VL  BW  DQ  
VADD  Yes  Yes  No  VADDPD , VADDPS , VADDSD , VADDSS

Y  Y  N  N 
VAND  VANDPD , VANDPS , VANDNPD , VANDNPS

N  Y  
VCMP  VCMPPD , VCMPPS , VCMPSD , VCMPSS

Y  N  N  
VCOM  VCOMISD , VCOMISS
 
VDIV  VDIVPD , VDIVPS , VDIVSD , VDIVSS

Y  
VCVT  VCVTDQ2PD , VCVTDQ2PS , VCVTPD2DQ , VCVTPD2PS , VCVTPH2PS , VCVTPS2PH , VCVTPS2DQ , VCVTPS2PD , VCVTSD2SI , VCVTSD2SS , VCVTSI2SD , VCVTSI2SS , VCVTSS2SD , VCVTSS2SI , VCVTTPD2DQ , VCVTTPS2DQ , VCVTTSD2SI , VCVTTSS2SI
 
VMAX  VMAXPD , VMAXPS , VMAXSD , VMAXSS
 
VMIN  VMINPD , VMINPS , VMINSD , VMINSS

N  
VMOV  VMOVAPD , VMOVAPS , VMOVD , VMOVQ , VMOVDDUP , VMOVHLPS , VMOVHPD , VMOVHPS , VMOVLHPS , VMOVLPD , VMOVLPS , VMOVNTDQA , VMOVNTDQ , VMOVNTPD , VMOVNTPS , VMOVSD , VMOVSHDUP , VMOVSLDUP , VMOVSS , VMOVUPD , VMOVUPS , VMOVDQA32 , VMOVDQA64 , VMOVDQU8 , VMOVDQU16 , VMOVDQU32 , VMOVDQU64

Y  Y  
VMUL  VMULPD , VMULPS , VMULSD , VMULSS

N  
VOR  VORPD , VORPS

N  Y  
VSQRT  VSQRTPD , VSQRTPS , VSQRTSD , VSQRTSS

Y  N  
VSUB  VSUBPD , VSUBPS , VSUBSD , VSUBSS
 
VUCOMI  VUCOMISD , VUCOMISS

N  
VUNPCK  VUNPCKHPD , VUNPCKHPS , VUNPCKLPD , VUNPCKLPS

Y  
VXOR  VXORPD , VXORPS

N  Y  
VEXTRACTPS  No  Yes  No  VEXTRACTPS

Y  N  N  
VINSERTPS  VINSERTPS
 
VPEXTR  VPEXTRB , VPEXTRW , VPEXTRD , VPEXTRQ

N  Y  Y  
VPINSR  VPINSRB , VPINSRW , VPINSRD , VPINSRQ
 
VPACK  Yes  Yes  Yes  VPACKSSWB , VPACKSSDW , VPACKUSDW , VPACKUSWB

Y  N  
VPADD  VPADDB , VPADDW , VPADDD , VPADDQ , VPADDSB , VPADDSW , VPADDUSB , VPADDUSW

Y  
VPAND  VPANDD , VPANDQ , VPANDND , VPANDNQ

N  
VPAVG  VPAVGB , VPAVGW

N  Y  
VPCMP  VPCMPEQB , VPCMPEQW , VPCMPEQD , VPCMPEQQ , VPCMPGTB , VPCMPGTW , VPCMPGTD , VPCMPGTQ

Y  
VPMAX  VPMAXSB , VPMAXSW , VPMAXSD , VPMAXSQ , VPMAXUB , VPMAXUW , VPMAXUD , VPMAXUQ
 
VPMIN  VPMINSB , VPMINSW , VPMINSD , VPMINSQ , VPMINUB , VPMINUW , VPMINUD , VPMINUQ
 
VPMOV  VPMOVSXBW , VPMOVSXBD , VPMOVSXBQ , VPMOVSXWD , VPMOVSXWQ , VPMOVSXDQ , VPMOVZXBW , VPMOVZXBD , VPMOVZXBQ , VPMOVZXWD , VPMOVZXWQ , VPMOVZXDQ
 
VPMUL  VPMULDQ , VPMULUDQ , VPMULHRSW , VPMULHUW , VPMULHW , VPMULLD , VPMULLQ , VPMULLW
 
VPOR  VPORD , VPORQ

N  
VPSUB  VPSUBB , VPSUBW , VPSUBD , VPSUBQ , VPSUBSB , VPSUBSW , VPSUBUSB , VPSUBUSW

Y  
VPUNPCK  VPUNPCKHBW , VPUNPCKHWD , VPUNPCKHDQ , VPUNPCKHQDQ , VPUNPCKLBW , VPUNPCKLWD , VPUNPCKLDQ , VPUNPCKLQDQ
 
VPXOR  VPXORD , VPXORQ

N  
VPSADBW  VPSADBW

N  Y  
VPSHUF  VPSHUFB , VPSHUFHW , VPSHUFLW , VPSHUFD , VPSLLDQ , VPSLLW , VPSLLD , VPSLLQ , VPSRAW , VPSRAD , VPSRAQ , VPSRLDQ , VPSRLW , VPSRLD , VPSRLQ , VPSLLVW , VPSLLVD , VPSLLVQ , VPSRLVW , VPSRLVD , VPSRLVQ , VPSHUFPD , VPSHUFPS

Y  
VEXTRACT  No  Yes  Yes  VEXTRACTF32X4 , VEXTRACTF64X2 , VEXTRACTF32X8 , VEXTRACTF64X4 , VEXTRACTI32X4 , VEXTRACTI64X2 , VEXTRACTI32X8 , VEXTRACTI64X4

N  Y  
VINSERT  VINSERTF32x4 , VINSERTF64X2 , VINSERTF32X8 , VINSERTF64x4 , VINSERTI32X4 , VINSERTI64X2 , VINSERTI32X8 , VINSERTI64X4
 
VPABS  VPABSB , VPABSW , VPABSD , VPABSQ

Y  N  
VPALIGNR  VPALIGNR

N  
VPERM  VPERMD , VPERMILPD , VPERMILPS , VPERMPD , VPERMPS , VPERMQ

Y  N  
VPMADD  VPMADDUBSW VPMADDWD

N  Y  
VFMADD  No  No  Yes  VFMADD132PD , VFMADD213PD , VFMADD231PD , VFMADD132PS , VFMADD213PS , VFMADD231PS , VFMADD132SD , VFMADD213SD , VFMADD231SD , VFMADD132SS , VFMADD213SS , VFMADD231SS

Y  N  
VFMADDSUB  VFMADDSUB132PD , VFMADDSUB213PD , VFMADDSUB231PD , VFMADDSUB132PS , VFMADDSUB213PS , VFMADDSUB231PS
 
VFMSUBADD  VFMSUBADD132PD , VFMSUBADD213PD , VFMSUBADD231PD , VFMSUBADD132PS , VFMSUBADD213PS , VFMSUBADD231PS
 
VFMSUB  VFMSUB132PD , VFMSUB213PD , VFMSUB231PD , VFMSUB132PS , VFMSUB213PS , VFMSUB231PS , VFMSUB132SD , VFMSUB213SD , VFMSUB231SD , VFMSUB132SS , VFMSUB213SS , VFMSUB231SS
 
VFNMADD  VFNMADD132PD , VFNMADD213PD , VFNMADD231PD , VFNMADD132PS , VFNMADD213PS , VFNMADD231PS , VFNMADD132SD , VFNMADD213SD , VFNMADD231SD , VFNMADD132SS , VFNMADD213SS , VFNMADD231SS
 
VFNMSUB  VFNMSUB132PD , VFNMSUB213PD , VFNMSUB231PD , VFNMSUB132PS , VFNMSUB213PS , VFNMSUB231PS , VFNMSUB132SD , VFNMSUB213SD , VFNMSUB231SD , VFNMSUB132SS , VFNMSUB213SS , VFNMSUB231SS
 
VGATHER  VGATHERDPS , VGATHERDPD , VGATHERQPS , VGATHERQPD
 
VPGATHER  VPGATHERDD , VPGATHERDQ , VPGATHERQD , VPGATHERQQ
 
VPSRAV  VPSRAVW , VPSRAVD , VPSRAVQ

Y 
Subset  F  CD  ER  PF  4FMAPS  4VNNIW  VPOPCNTDQ  VL  DQ  BW  IFMA  VBMI  VNNI  BF16  VBMI2  BITALG  VPCLMULQDQ  GFNI  VAES  VP2INTERSECT  FP16 

Knights Landing (Xeon Phi x200, 2016)  Yes  Yes  No  
Knights Mill (Xeon Phi x205, 2017)  Yes  No  
SkylakeSP, SkylakeX (2017)  No  No  Yes  No  
Cannon Lake (2018)  Yes  No  
Cascade Lake (2019)  No  Yes  No  
Cooper Lake (2020)  Yes  No  
Ice Lake (2019)  Yes  No  Yes  No  
Tiger Lake (2020)  Yes  No  
Rocket Lake (2021)  No  
Alder Lake (2021)  Partial^{Note 1}  Partial^{Note 1}  
Zen 4 (2022)  Yes  Yes  No  
Sapphire Rapids (2023)  No  Yes  
Zen 5 (2024)  Yes  No 
^Note 1 : Intel does not officially support AVX512 family of instructions on the Alder Lake microprocessors. Intel has disabled in silicon (fused off) AVX512 on recent steppings of Alder Lake microprocessors to prevent customers from enabling AVX512.^{[33]} In older Alder Lake family CPUs with some legacy combinations of BIOS and microcode revisions, it was possible to execute AVX512 family instructions when disabling all the efficiency cores which do not contain the silicon for AVX512.^{[34]}^{[35]}^{[22]}
Intel Vectorization Advisor (starting from version 2017) supports native AVX512 performance and vector code quality analysis (for "Core", Xeon and Intel Xeon Phi processors). Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX512 ISA metrics and new AVX512specific "traits", e.g. Scatter, Compress/Expand, mask utilization.^{[36]}^{[37]}
On some processors (mostly preIce Lake Intel), AVX512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256bit part of AVX512 (AVX512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256bit vectors for Intel targets.^{[38]}^{[39]}^{[40]}
C/C++ compilers also automatically handle loop unrolling and preventing stalls in the pipeline in order to use AVX512 most effectively, which means a programmer using language intrinsics to try to force use of AVX512 can sometimes result in worse performance relative to the code generated by the compiler when it encounters loops plainly written in the source code.^{[41]} In other cases, using AVX512 intrinsics in C/C++ code can result in a performance improvement relative to plainly written C/C++.^{[42]}
There are many examples of AVX512 applications, including media processing, cryptography, video games,^{[43]} neural networks,^{[44]} and even OpenJDK, which employs AVX512 for sorting.^{[45]}
In a muchcited quote from 2020, Linus Torvalds said "I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on,"^{[46]} stating that he would prefer the transistor budget be spent on additional cores and integer performance instead, and that he "detests" floating point benchmarks.^{[47]}
Numenta touts their "highly sparse"^{[48]} neural network technology, which they say obviates the need for GPUs as their algorithms run on CPUs with AVX512.^{[49]} They claim a ten times speedup relative to A100 largely because their algorithms reduce the size of the neural network, while maintaining accuracy, by techniques such as the Sparse Evolutionary Training (SET) algorithm^{[50]} and Foresight Pruning.^{[51]}