Floatingpoint formats 

IEEE 754 

Other 
Alternatives 
Computer architecture bit widths 

Bit 
Application 
Binary floatingpoint precision 
Decimal floatingpoint precision 
In computing, minifloats are floatingpoint values represented with very few bits. Predictably, they are not well suited for generalpurpose numerical calculations. They are used for special purposes such as
Additionally, they are frequently encountered as a pedagogical tool in computerscience courses to demonstrate the properties and structures of floatingpoint arithmetic and IEEE 754 numbers.
Minifloats with 16 bits are halfprecision numbers (opposed to single and double precision). There are also minifloats with 8 bits or even fewer.^{[2]}
Minifloats can be designed following the principles of the IEEE 754 standard. In this case they must obey the (not explicitly written) rules for the frontier between subnormal and normal numbers and must have special patterns for infinity and NaN. Normalized numbers are stored with a biased exponent. The new revision of the standard, IEEE 7542008, has 16bit binary minifloats.
A minifloat is usually described using a tuple of four numbers, (S, E, M, B):
A minifloat format denoted by (S, E, M, B) is, therefore, S + E + M bits long. The (S, E, M, B) notation can be converted to a (B, P, L, U) format as (2, M + 1, B + 1, 2^{S} − B) (with IEEE use of exponents).
sign  exponent  significand  

0  0  0  0  0  0  0  0 
A minifloat in 1 byte (8 bit) with 1 sign bit, 4 exponent bits and 3 significand bits (in short, a 1.4.3 minifloat) is demonstrated here. The exponent bias is defined as 7 to center the values around 1 to match other IEEE 754 floats^{[3]}^{[4]} so (for most values) the actual multiplier for exponent x is 2^{x−7}. All IEEE 754 principles should be valid.^{[5]}
Numbers in a different base are marked as ..._{base}, for example, 101_{2} = 5. The bit patterns have spaces to visualize their parts.
Zero is represented as zero exponent with a zero mantissa. The zero exponent means zero is a subnormal number with a leading "0." prefix, and with the zero mantissa all bits after the decimal point are zero, meaning this value is interpreted as . Floating point numbers use a signed zero, so is also available and is equal to positive .
0 0000 000 = 0 1 0000 000 = −0
The significand is extended with "0." and the exponent value is treated as 1 higher like the least normalized number:
0 0000 001 = 0.001_{2} × 2^{1  7} = 0.125 × 2^{6} = 0.001953125 (least subnormal number) ... 0 0000 111 = 0.111_{2} × 2^{1  7} = 0.875 × 2^{6} = 0.013671875 (greatest subnormal number)
The significand is extended with "1.":
0 0001 000 = 1.000_{2} × 2^{1  7} = 1 × 2^{6} = 0.015625 (least normalized number) 0 0001 001 = 1.001_{2} × 2^{1  7} = 1.125 × 2^{6} = 0.017578125 ... 0 0111 000 = 1.000_{2} × 2^{7  7} = 1 × 2^{0} = 1 0 0111 001 = 1.001_{2} × 2^{7  7} = 1.125 × 2^{0} = 1.125 (least value above 1) ... 0 1110 000 = 1.000_{2} × 2^{14  7} = 1.000 × 2^{7} = 128 0 1110 001 = 1.001_{2} × 2^{14  7} = 1.125 × 2^{7} = 144 ... 0 1110 110 = 1.110_{2} × 2^{14  7} = 1.750 × 2^{7} = 224 0 1110 111 = 1.111_{2} × 2^{14  7} = 1.875 × 2^{7} = 240 (greatest normalized number)
Infinity values have the highest exponent, with the mantissa set to zero. The sign bit can be either positive or negative.
0 1111 000 = +infinity 1 1111 000 = −infinity
NaN values have the highest exponent, with a nonzero value for the mantissa. A float with 1bit sign and 3bit mantissa has NaN values.
s 1111 mmm = NaN (if mmm ≠ 000)
This is a chart of all possible values for this example 8bit float.
… 000  … 001  … 010  … 011  … 100  … 101  … 110  … 111  

0 0000 …  0  0.001953125  0.00390625  0.005859375  0.0078125  0.009765625  0.01171875  0.013671875 
0 0001 …  0.015625  0.017578125  0.01953125  0.021484375  0.0234375  0.025390625  0.02734375  0.029296875 
0 0010 …  0.03125  0.03515625  0.0390625  0.04296875  0.046875  0.05078125  0.0546875  0.05859375 
0 0011 …  0.0625  0.0703125  0.078125  0.0859375  0.09375  0.1015625  0.109375  0.1171875 
0 0100 …  0.125  0.140625  0.15625  0.171875  0.1875  0.203125  0.21875  0.234375 
0 0101 …  0.25  0.28125  0.3125  0.34375  0.375  0.40625  0.4375  0.46875 
0 0110 …  0.5  0.5625  0.625  0.6875  0.75  0.8125  0.875  0.9375 
0 0111 …  1  1.125  1.25  1.375  1.5  1.625  1.75  1.875 
0 1000 …  2  2.25  2.5  2.75  3  3.25  3.5  3.75 
0 1001 …  4  4.5  5  5.5  6  6.5  7  7.5 
0 1010 …  8  9  10  11  12  13  14  15 
0 1011 …  16  18  20  22  24  26  28  30 
0 1100 …  32  36  40  44  48  52  56  60 
0 1101 …  64  72  80  88  96  104  112  120 
0 1110 …  128  144  160  176  192  208  224  240 
0 1111 …  Inf  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
1 0000 …  −0  −0.001953125  −0.00390625  −0.005859375  −0.0078125  −0.009765625  −0.01171875  −0.013671875 
1 0001 …  −0.015625  −0.017578125  −0.01953125  −0.021484375  −0.0234375  −0.025390625  −0.02734375  −0.029296875 
1 0010 …  −0.03125  −0.03515625  −0.0390625  −0.04296875  −0.046875  −0.05078125  −0.0546875  −0.05859375 
1 0011 …  −0.0625  −0.0703125  −0.078125  −0.0859375  −0.09375  −0.1015625  −0.109375  −0.1171875 
1 0100 …  −0.125  −0.140625  −0.15625  −0.171875  −0.1875  −0.203125  −0.21875  −0.234375 
1 0101 …  −0.25  −0.28125  −0.3125  −0.34375  −0.375  −0.40625  −0.4375  −0.46875 
1 0110 …  −0.5  −0.5625  −0.625  −0.6875  −0.75  −0.8125  −0.875  −0.9375 
1 0111 …  −1  −1.125  −1.25  −1.375  −1.5  −1.625  −1.75  −1.875 
1 1000 …  −2  −2.25  −2.5  −2.75  −3  −3.25  −3.5  −3.75 
1 1001 …  −4  −4.5  −5  −5.5  −6  −6.5  −7  −7.5 
1 1010 …  −8  −9  −10  −11  −12  −13  −14  −15 
1 1011 …  −16  −18  −20  −22  −24  −26  −28  −30 
1 1100 …  −32  −36  −40  −44  −48  −52  −56  −60 
1 1101 …  −64  −72  −80  −88  −96  −104  −112  −120 
1 1110 …  −128  −144  −160  −176  −192  −208  −224  −240 
1 1111 …  −Inf  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
There are only 242 different nonNaN values (if +0 and −0 are regarded as different), because 14 of the bit patterns represent NaNs.
Tables like the above can be generated for any combination of SEMB values using a script in Python or in GDScript.
At these small sizes other bias values may be interesting, for instance a bias of 2 will make the numbers 016 have the same bit representation as the integers 016, with the loss that no noninteger values can be represented.
0 0000 000 = 0.000_{2} × 2^{1  (2)} = 0.0 × 2^{3} = 0 (subnormal number) 0 0000 001 = 0.001_{2} × 2^{1  (2)} = 0.125 × 2^{3} = 1 (subnormal number) 0 0000 111 = 0.111_{2} × 2^{1  (2)} = 0.875 × 2^{3} = 7 (subnormal number) 0 0001 000 = 1.000_{2} × 2^{1  (2)} = 1.000 × 2^{3} = 8 (normalized number) 0 0001 111 = 1.111_{2} × 2^{1  (2)} = 1.875 × 2^{3} = 15 (normalized number) 0 0010 000 = 1.000_{2} × 2^{2  (2)} = 1.000 × 2^{4} = 16 (normalized number)
The graphic demonstrates the addition of even smaller (1.3.2.3)minifloats with 6 bits. This floatingpoint system follows the rules of IEEE 754 exactly. NaN as operand produces always NaN results. Inf − Inf and (−Inf) + Inf results in NaN too (green area). Inf can be augmented and decremented by finite values without change. Sums with finite operands can give an infinite result (i.e. 14.0 + 3.0 = +Inf as a result is the cyan area, −Inf is the magenta area). The range of the finite operands is filled with the curves x + y = c, where c is always one of the representable float values (blue and red for positive and negative results respectively).
The other arithmetic operations can be illustrated similarly:
The Radeon R300 and R420 GPUs used an "fp24" floatingpoint format with 7 bits of exponent and 16 bits (+1 implicit) of mantissa.^{[6]} "Full Precision" in Direct3D 9.0 is a proprietary 24bit floatingpoint format. Microsoft's D3D9 (Shader Model 2.0) graphics API initially supported both FP24 (as in ATI's R300 chip) and FP32 (as in Nvidia's NV30 chip) as "Full Precision", as well as FP16 as "Partial Precision" for vertex and pixel shader calculations performed by the graphics hardware.
Khronos defines 10bit and 11bit float formats for use with Vulkan. Both formats have no sign bit and a 5bit exponent. The 10bit format has a 5bit mantissa, and the 11bit format has a 6bit mantissa.^{[7]}^{[8]}
IEEE SA Working Group P3109 is currently working on a standard for 8bit minifloats optimized for machine learning. The current draft defines not one format, but a family of 7 different formats, named "binary8pP", where "P" is a number from 1 to 7. These floats are designed to be compact and efficient, but do not follow the same semantics as other IEEE floats, and are missing features such as negative zero and multiple NaN values. Infinity is defined as both the exponent and significand having all ones, instead of other IEEE floats where the exponent is all ones and the significand is all zeroes.^{[9]}
The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4bit float with 1bit sign, 2bit exponent, and 1bit mantissa.^{[10]} In the table below, the columns have different values for the sign and mantissa bits, and the rows are different values for the exponent bits.
0 … 0  0 … 1  1 … 0  1 … 1  

… 00 …  0  0.5  −0  −0.5 
… 01 …  1  1.5  −1  −1.5 
… 10 …  2  3  −2  −3 
… 11 …  Inf  NaN  −Inf  NaN 
If normalized numbers are not required, the size can be reduced to 3bit by reducing the exponent down to 1.
0 … 0  0 … 1  1 … 0  1 … 1  

… 0 …  0  1  −0  −1 
… 1 …  Inf  NaN  −Inf  NaN 
In situations where the sign bit can be excluded, each of the above examples can be reduced by 1 bit further, keeping only the left half of the above tables. A 2bit float with 1bit exponent and 1bit mantissa would only have 0, 1, Inf, NaN values.
If the mantissa is allowed to be 0bit, a 1bit float format would have a 1bit exponent, and the only two values would be 0 and Inf. The exponent must be at least 1 bit or else it no longer makes sense as a float (it would just be a signed number).
4bit floating point numbers — without the four special IEEE values — have found use in accelerating large language models.^{[11]}
Minifloats are also commonly used in embedded devices,^{[citation needed]} especially on microcontrollers where floatingpoint will need to be emulated in software. To speed up the computation, the mantissa typically occupies exactly half of the bits, so the register boundary automatically addresses the parts without shifting.