Casting (metalworking)

Double-precision floating-point format is a computer number format that occupies 8 bytes (64 bits) in computer memory and represents a wide dynamic range of values by using floating point.

Computers with 32-bit storage locations use two memory locations to store a 64-bit double-precision number (a single storage location can hold a single-precision number). Double-precision floating-point format usually refers to binary64, as specified by the IEEE 754 standard, not to the 64-bit decimal format decimal64.

Template:Floating-point

IEEE 754 double-precision binary floating-point format: binary64

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision floating-point format, it lacks precision on integer numbers when compared with an integer format of the same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)

This gives from 15–17 significant decimal digits precision. If a decimal string with at most 15 significant digits is converted to IEEE 754 double precision representation and then converted back to a string with the same number of significant digits, then the final string should match the original; and if an IEEE 754 double precision is converted to a decimal string with at least 17 significant digits and then converted back to double, then the final number must match the original.^[1]

The format is written with the significand having an implicit integer bit of value 1 (except for special datums, see the exponent encoding below). With the 52 bits of the fraction significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log₁₀(2) ≈ 15.955). The bits are laid out as follows:

The real value assumed by a given 64-bit double-precision datum with a given biased exponent e and a 52-bit fraction is

(-1)^{\text{sign}}(1.b_{51}b_{50}...b_{0})_{2}\times 2^{e-1023}

or

(-1)^{\text{sign}}(1+\sum _{i=1}^{52}b_{52-i}2^{-i})\times 2^{e-1023}

Between 2^{52=4,503,599,627,370,496 and 2^{53=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2^{53 to 2^{54, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2^{51 to 2^{52, the spacing is 0.5, etc.}}}}}}

The spacing as a fraction of the numbers in the range from 2^{n to 2^{n+1 is 2^{n−52.
The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2^−53.}}}

Exponent encoding

The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:

E_min (1) = −1022
E (50) = −973
E_max (2046) = 1023

Thus, as defined by the offset-binary representation, in order to get the true exponent the exponent bias of 1023 has to be subtracted from the written exponent.

The exponents 000₁₆ and 7ff₁₆ have a special meaning:

000₁₆ is used to represent a signed zero (if M=0) and subnormals (if M≠0); and
7ff₁₆ is used to represent ∞ (if M=0) and NaNs (if M≠0),

where M is the fraction mantissa. All bit patterns are valid encoding.

Except for the above exceptions, the entire double-precision number is described by:

$(-1)^{\text{sign}}\times 2^{{\text{exponent}}-{\text{exponent bias}}}\times 1.{\text{mantissa}}$

In the case of subnormals(E=0) the double-precision number is described by:

$(-1)^{\text{sign}}\times 2^{1-{\text{exponent bias}}}\times 0.{\text{mantissa}}$

Double-precision examples

3ff0 0000 0000 0000₁₆   = 1
3ff0 0000 0000 0001₁₆   ≈ 1.0000000000000002, the smallest number > 1
3ff0 0000 0000 0002₁₆   ≈ 1.0000000000000004
4000 0000 0000 0000₁₆   = 2
c000 0000 0000 0000₁₆   = –2

0000 0000 0000 0001₁₆   = 2^−1022−52 = 2⁻¹⁰⁷⁴
                        ≈ 4.9406564584124654 × 10⁻³²⁴ (Min subnormal positive double)
000f ffff ffff ffff₁₆   = 2⁻¹⁰²² − 2^−1022−52 
                        ≈ 2.2250738585072009 × 10⁻³⁰⁸ (Max subnormal double)
0010 0000 0000 0000₁₆   = 2⁻¹⁰²² 
                        ≈ 2.2250738585072014 × 10⁻³⁰⁸ (Min normal positive double)
7fef ffff ffff ffff₁₆   = (1 + (1 − 2⁻⁵²)) × 2¹⁰²³ 
                        ≈ 1.7976931348623157 × 10³⁰⁸ (Max Double)

0000 0000 0000 0000₁₆   = 0
8000 0000 0000 0000₁₆   = –0

7ff0 0000 0000 0000₁₆   = ∞
fff0 0000 0000 0000₁₆   = −∞

3fd5 5555 5555 5555₁₆   ≈ 1/3

(1/3 rounds down instead of up like single precision, because of the odd number of bits in the significand.)

In more detail:

Given the hexadecimal representation 3FD5 5555 5555 5555₁₆,
  Sign = 0
  Exponent = 3FD₁₆ = 1021
  Exponent Bias = 1023 (constant value; see above)
  Significand = 5 5555 5555 5555₁₆
  Value = 2^{(Exponent − Exponent Bias)} × 1.Significand – Note the Significand must not be converted to decimal here
        = 2⁻² × (15 5555 5555 5555₁₆ × 2⁻⁵²)
        = 2⁻⁵⁴ × 15 5555 5555 5555₁₆
        = 0.333333333333333314829616256247390992939472198486328125
        ≈ 1/3

Execution speed with double-precision arithmetic

Using floating-point variables and mathematical functions (sin(), cos(), atan2(), log(), exp(), sqrt() are the most popular ones) of double precision as opposed to single precision comes at execution cost: the operations with double precision are usually slower. On average, on a PC of year 2012 build, calculations with double precision are 1.1–1.6 times slower than with single precision.Potter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park.

One area of computing where this is a particular issue is for parallel code running on GPUs. For example when using NVIDIA's CUDA platform, calculations with double precision are 3 to 8 times slower than float.^[2]

Notes and references

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

[whyieee-1] Template:Cite web

[2] ttp://www.tomshardware.com/reviews/geforce-gtx-titan-gk110-review,3438-3.html

[1]

[2]

Casting (metalworking)

Contents

IEEE 754 double-precision binary floating-point format: binary64

Exponent encoding

Double-precision examples

Execution speed with double-precision arithmetic

See also

Notes and references

Navigation menu

Casting (metalworking)

IEEE 754 double-precision binary floating-point format: binary64

Exponent encoding

Double-precision examples

Execution speed with double-precision arithmetic

See also

Notes and references

Navigation menu

Search