Floatingpoint formats 

IEEE 754 

Other 
Computer architecture bit widths 

Bit 
Application 
Binary floatingpoint precision 
Decimal floatingpoint precision 
Decimal floatingpoint (DFP) arithmetic refers to both a representation and operations on decimal floatingpoint numbers. Working directly with decimal (base10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in humanentered data, such as measurements or financial information) and binary (base2) fractions.
The advantage of decimal floatingpoint representation over decimal fixedpoint and integer representation is that it supports a much wider range of values. For example, while a fixedpoint representation that allocates 8 decimal digits and 2 decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floatingpoint representation with 8 decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.
Early mechanical uses of decimal floating point are evident in the abacus, slide rule, the Smallwood calculator, and some other calculators that support entries in scientific notation. In the case of the mechanical calculators, the exponent is often treated as side information that is accounted for separately.
The IBM 650 computer supported an 8digit decimal floatingpoint format in 1953.^{[1]} The otherwise binary Wang VS machine supported a 64bit decimal floatingpoint format in 1977.^{[2]} The floatingpoint support library for the Motorola 68040 processor provided a 96bit decimal floatingpoint storage format in 1990.^{[2]}
Some computer languages have implementations of decimal floatingpoint arithmetic, including PL/I, C#, Java with BigDecimal, emacs with calc, and Python's decimal module. In 1987, the IEEE released IEEE 854, a standard for computing with decimal floating point, which lacked a specification for how floatingpoint data should be encoded for interchange with other systems. This was subsequently addressed in IEEE 7542008, which standardized the encoding of decimal floatingpoint data, albeit with two different alternative methods.
IBM POWER6 and newer POWER processors include DFP in hardware, as does the IBM System z9^{[3]} (and later zSeries machines). SilMinds offers SilAx, a configurable vector DFP coprocessor.^{[4]} IEEE 7542008 defines this in more detail. Fujitsu also has 64bit Sparc processors with DFP in hardware.^{[5]}^{[2]}
Microsoft C#, or .NET, uses System.Decimal.^{[6]}
The IEEE 7542008 standard defines 32, 64 and 128bit decimal floatingpoint representations. Like the binary floatingpoint formats, the number is divided into a sign, an exponent, and a significand. Unlike binary floatingpoint, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×10^{2}=0.1×10^{3}=0.01×10^{4}, etc. When the significand is zero, the exponent can be any value at all.
decimal32  decimal64  decimal128  decimal(32k)  Format 

1  1  1  1  Sign field (bits) 
5  5  5  5  Combination field (bits) 
6  8  12  w = 2×k + 4  Exponent continuation field (bits) 
20  50  110  t = 30×k−10  Coefficient continuation field (bits) 
32  64  128  32×k  Total size (bits) 
7  16  34  p = 3×t/10+1 = 9×k−2  Coefficient size (decimal digits) 
192  768  12288  3×2^{w} = 48×4^{k}  Exponent range 
96  384  6144  Emax = 3×2^{w−1}  Largest value is 9.99...×10^{Emax} 
−95  −383  −6143  Emin = 1−Emax  Smallest normalized value is 1.00...×10^{Emin} 
−101  −398  −6176  Etiny = 2−p−Emax  Smallest nonzero value is 1×10^{Etiny} 
The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.
Two different representations are defined:
Both alternatives provide exactly the same range of representable values.
The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5bit field, along with special forms for infinity and NaN.
If the most significant 4 bits of the significand are between 0 and 7, the encoded value begins as follows:
s 00mmm xxx Exponent begins with 00, significand with 0mmm s 01mmm xxx Exponent begins with 01, significand with 0mmm s 10mmm xxx Exponent begins with 10, significand with 0mmm
If the leading 4 bits of the significand are binary 1000 or 1001 (decimal 8 or 9), the number begins as follows:
s 1100m xxx Exponent begins with 00, significand with 100m s 1101m xxx Exponent begins with 01, significand with 100m s 1110m xxx Exponent begins with 10, significand with 100m
The leading bit (s in the above) is a sign bit, and the following bits (xxx in the above) encode the additional exponent bits and the remainder of the most significant digit, but the details vary depending on the encoding alternative used.
The final combinations are used for infinities and NaNs, and are the same for both alternative encodings:
s 11110 x ±Infinity (see Extended real number line) s 11111 0 quiet NaN (sign bit ignored) s 11111 1 signaling NaN (sign bit ignored)
In the latter cases, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to NaNs by filling it with a single byte value.
This format uses a binary significand from 0 to 10^{p}−1. For example, the Decimal32 significand can be up to 10^{7}−1 = 9999999 = 98967F_{16} = 100110001001011001111111_{2}. While the encoding can represent larger significands, they are illegal and the standard requires implementations to treat them as 0, if encountered on input.
As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (0000_{2} to 0111_{2}), or higher (1000_{2} or 1001_{2}).
If the 2 bits after the sign bit are "00", "01", or "10", then the exponent field consists of the 8 bits following the sign bit (the 2 bits mentioned plus 6 bits of "exponent continuation field"), and the significand is the remaining 23 bits, with an implicit leading 0 bit, shown here in parentheses:
s 00eeeeee (0)ttt tttttttttt tttttttttt s 01eeeeee (0)ttt tttttttttt tttttttttt s 10eeeeee (0)ttt tttttttttt tttttttttt
This includes subnormal numbers where the leading significand digit is 0.
If the 2 bits after the sign bit are "11", then the 8bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 21 bits. In this case there is an implicit (that is, not stored) leading 3bit sequence "100" in the true significand:
s 1100eeeeee (100)t tttttttttt tttttttttt s 1101eeeeee (100)t tttttttttt tttttttttt s 1110eeeeee (100)t tttttttttt tttttttttt
The "11" 2bit sequence after the sign bit indicates that there is an implicit "100" 3bit prefix to the significand.
Note that the leading bits of the significand field do not encode the most significant decimal digit; they are simply part of a larger purebinary number. For example, a significand of 8000000 is encoded as binary 011110100001001000000000, with the leading 4 bits encoding 7; the first significand which requires a 24th bit (and thus the second encoding form) is 2^{23} = 8388608.
In the above cases, the value represented is:
Decimal64 and Decimal128 operate analogously, but with larger exponent continuation and significand fields. For Decimal128, the second encoding form is actually never used; the largest valid significand of 10^{34}−1 = 1ED09BEAD87C0378D8E63FFFFFFFF_{16} can be represented in 113 bits.
In this version, the significand is stored as a series of decimal digits. The leading digit is between 0 and 9 (3 or 4 binary bits), and the rest of the significand uses the densely packed decimal (DPD) encoding.
The leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand are combined into the five bits that follow the sign bit. This is followed by a fixedoffset exponent continuation field.
Finally, the significand continuation field made of 2, 5, or 11 10bit declets, each encoding 3 decimal digits.^{[7]}
If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits after that are interpreted as the leading decimal digit (0 to 7):^{[8]}
Comb. Exponent Significand s 00 TTT (00)eeeeee (0TTT)[tttttttttt][tttttttttt] s 01 TTT (01)eeeeee (0TTT)[tttttttttt][tttttttttt] s 10 TTT (10)eeeeee (0TTT)[tttttttttt][tttttttttt]
If the first two bits after the sign bit are "11", then the second two bits are the leading bits of the exponent, and the last bit is prefixed with "100" to form the leading decimal digit (8 or 9):
Comb. Exponent Significand s 1100 T (00)eeeeee (100T)[tttttttttt][tttttttttt] s 1101 T (01)eeeeee (100T)[tttttttttt][tttttttttt] s 1110 T (10)eeeeee (100T)[tttttttttt][tttttttttt]
The remaining two combinations (11110 and 11111) of the 5bit field are used to represent ±infinity and NaNs, respectively.
The usual rule for performing floatingpoint arithmetic is that the exact mathematical value is calculated,^{[9]} and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEEcompliant computer hardware, under normal rounding behavior and in the absence of exceptional conditions.
For ease of presentation and understanding, 7digit precision will be used in the examples. The fundamental principles are the same in any precision.
A simple method to add floatingpoint numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by 3 digits. We proceed with the usual addition method:
The following example is decimal, which simply means the base is 10.
123456.7 = 1.234567 × 10^{5} 101.7654 = 1.017654 × 10^{2} = 0.001017654 × 10^{5}
Hence:
123456.7 + 101.7654 = (1.234567 × 10^{5}) + (1.017654 × 10^{2}) = (1.234567 × 10^{5}) + (0.001017654 × 10^{5}) = 10^{5} × (1.234567 + 0.001017654) = 10^{5} × 1.235584654
This is nothing other than converting to scientific notation. In detail:
e=5; s=1.234567 (123456.7) + e=2; s=1.017654 (101.7654)
e=5; s=1.234567 + e=5; s=0.001017654 (after shifting)  e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to 7 digits and then normalized if necessary. The final result is:
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is roundoff error. In extreme cases, the sum of two nonzero numbers may be equal to one of them:
e=5; s=1.234567 + e=−3; s=9.876543
e=5; s=1.234567 + e=5; s=0.00000009876543 (after shifting)  e=5; s=1.23456709876543 (true sum) e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.
e=5; s=1.234571 − e=5; s=1.234567  e=5; s=0.000004 e=−1; s=4.000000 (after rounding and normalization)
The floatingpoint difference is computed exactly because the numbers are close—the Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = −1; s = 4.877000, which differs more than 20% from the difference e = −1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.^{[10]}^{[11]} This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.
To multiply, the significands are multiplied, while the exponents are added, and the result is rounded and normalized.
e=3; s=4.734612 × e=5; s=5.417242  e=8; s=25.648538980104 (true product) e=8; s=25.64854 (after rounding) e=9; s=2.564854 (after normalization)
Division is done similarly, but that is more complicated.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex.
Further information: Booth's multiplication algorithm and Division algorithm 