The FMA instruction set is an extension to the 128 and 256bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.^{[1]} There are two variants:
FMA3 and FMA4 instructions have almost identical functionality, but are not compatible. Both contain fused multiply–add (FMA) instructions for floatingpoint scalar and SIMD operations, but FMA3 instructions have three operands, while FMA4 ones have four. The FMA operation has the form d = round(a · b + c), where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.
The fouroperand form (FMA4) allows a, b, c and d to be four different registers, while the threeoperand form (FMA3) requires that d be the same register as a, b or c. The threeoperand form makes the code shorter and the hardware implementation slightly simpler, while the fouroperand form provides more programming flexibility.
See XOP instruction set for more discussion of compatibility issues between Intel and AMD.
Supported commands include
Mnemonic  Operation  Mnemonic  Operation 

VFMADD  result = + a · b + c 
VFMADDSUB  result = a · b + c for i = 1, 3, ...result = a · b − c for i = 0, 2, ...

VFNMADD  result = − a · b + c
 
VFMSUB  result = + a · b − c 
VFMSUBADD  result = a · b − c for i = 1, 3, ...result = a · b + c for i = 0, 2, ...

VFNMSUB  result = − a · b − c

result = − a · b + c
, not result = − (a · b + c)
.Explicit order of operands is included in the mnemonic using numbers "132", "213", and "231":
Postfix 1 
Operation  possible memory operand 
overwrites 

132  a = a · c + b 
c (factor) 
a (other factor)

213  a = b · a + c 
c (summand) 
a (factor)

231  a = b · c + a 
c (factor) 
a (summand)

as well as operand format (packed or scalar) and size (single or double).
Postfix 2 
precision  size  Postfix 2 
precision  size 

SS  Single  32 bit  SD  Double  64 bit 
PSx  4× 32 bit  PDx  2× 64 bit  
PSy  8× 32 bit  PDy  4× 64 bit  
PSz  16× 32 bit  PDz  8× 64 bit 
This results in
Encoding  Mnemonic  Operands  Operation 

VEX.256.66.0F38.W1 98 /r

VFMADD132PDy  ymm, ymm, ymm/m256  a = a · c + b

VEX.256.66.0F38.W0 98 /r

VFMADD132PSy  
VEX.128.66.0F38.W1 98 /r

VFMADD132PDx  xmm, xmm, xmm/m128  
VEX.128.66.0F38.W0 98 /r

VFMADD132PSx  
VEX.LIG.66.0F38.W1 99 /r

VFMADD132SD  xmm, xmm, xmm/m64  
VEX.LIG.66.0F38.W0 99 /r

VFMADD132SS  xmm, xmm, xmm/m32  
VEX.256.66.0F38.W1 A8 /r

VFMADD213PDy  ymm, ymm, ymm/m256  a = b · a + c

VEX.256.66.0F38.W0 A8 /r

VFMADD213PSy  
VEX.128.66.0F38.W1 A8 /r

VFMADD213PDx  xmm, xmm, xmm/m128  
VEX.128.66.0F38.W0 A8 /r

VFMADD213PSx  
VEX.LIG.66.0F38.W1 A9 /r

VFMADD213SD  xmm, xmm, xmm/m64  
VEX.LIG.66.0F38.W0 A9 /r

VFMADD213SS  xmm, xmm, xmm/m32  
VEX.256.66.0F38.W1 B8 /r

VFMADD231PDy  ymm, ymm, ymm/m256  a = b · c + a

VEX.256.66.0F38.W0 B8 /r

VFMADD231PSy  
VEX.128.66.0F38.W1 B8 /r

VFMADD231PDx  xmm, xmm, xmm/m128  
VEX.128.66.0F38.W0 B8 /r

VFMADD231PSx  
VEX.LIG.66.0F38.W1 B9 /r

VFMADD231SD  xmm, xmm, xmm/m64  
VEX.LIG.66.0F38.W0 B9 /r

VFMADD231SS  xmm, xmm, xmm/m32 
Mnemonic (AT&T)  Operands  Operation 

VFMADDPDx  xmm, xmm, xmm/m128, xmm/m128  a = b·c + d 
VFMADDPDy  ymm, ymm, ymm/m256, ymm/m256  
VFMADDPSx  xmm, xmm, xmm/m128, xmm/m128  
VFMADDPSy  ymm, ymm, ymm/m256, ymm/m256  
VFMADDSD  xmm, xmm, xmm/m64, xmm/m64  
VFMADDSS  xmm, xmm, xmm/m32, xmm/m32 
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:
Different compilers provide different levels of support for FMA: