Abstract
Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.Introduction
In both How to Read Floating-Point Numbers Accurately[1990] and Printing floating-point numbers quickly and accurately[1996] the key issue is that successive rounding operations do not have the same effect as a single rounding operation.The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE-754 double-precision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.
Bignums
Both reading and writing of floating-point numbers can involve division of numbers larger than can be stored in the floating-point registers, causing rounding at undesired points of the conversion. Bignums (arbitrary precision integers) can perform division of large integers without rounding. What is needed is a bignum division-with-rounding operator, calledround-quotient
here. It can be implementated in Scheme as
follows:
(define (round-quotient num den) (define quo (quotient num den)) (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den)) (+ quo (if (eqv? (negative? num) (negative? den)) 1 -1)) quo))If the remainder is more than half of the denominator, then it rounds up; if it is less, then it rounds down; if it is equal, then it rounds to even. These are the same rounding rules as IEEE Standard for Binary Floating-Point Arithmetic (754-1985) and the Scheme procedure
round
.
The algorithms presented here use the integer-length function from Common-Lisp and the SLIB Scheme Library:
— Function: integer-length n
The algorithms also use Returns the number of bits neccessary to represent n.For positive argument n,
Example:
(integer-length #b10101010) ⇒ 8 (integer-length 0) ⇒ 0 (integer-length #b1111) ⇒ 4
integer-length
returns
⌈log2 n⌉.
ldexp
(from C). Here is Scheme
code for ldexp
:
;; ldexp() from C. (define (ldexp bmant bexp) (define DBL_MIN_10_EXP -1074) (if (> DBL_MIN_10_EXP (* -2 (abs bexp))) (* (* 1. bmant (expt 2 (quotient bexp 2))) (expt 2 (- bexp (quotient bexp 2)))) (* 1. bmant (expt 2 bexp))))The conditional in
ldexp
is in order to work
with bexp values where 2bexp is out
of floating-point range.
Reading
The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floating-point number is straightforward.;; dbl-mant-dig is the number of bits in the mantissa field of the ;; floating-point format. (define dbl-mant-dig 53) ;; Given positive integer mantissa MANT and exponent (of 10) POINT, ;; find the closest binary floating-point number. ;; bex is the binary shift for quo=mant/scl; bex is negative. (define (mantexp->dbl mant point) (if (>= point 0) (exact->inexact (* mant (expt 10 point))) (let* ((scl (expt 10 (- point))) (bex (- (integer-length mant) (integer-length scl) dbl-mant-dig)) (num (* mant (expt 2 (- bex)))) (quo (round-quotient num scl))) (cond ((> (integer-length quo) dbl-mant-dig) ;too many bits of quotient (set! bex (+ 1 bex)) (set! quo (round-quotient num (* scl 2))))) (ldexp (exact->inexact quo) bex))))When point is non-negative, the mantissa is multiplied by 10point and the product is rounded to fit in
dbl-mant-dig
bits by exact->inexact
.
With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10−point. Over the floating-point range, the longest a rounded quotient of a n bit number and a m bit power-of-10 can be is 1+n−m bits; the shortest is n−m bits.
2−bex is the power-of-two multiplier. The initial value of bex corresponds to the n−m case above. If the round-quotient is more than
dbl-mant-dig
bits long, then
call round-quotient
with double the denominator,
2⋅scl. In either case, the final step is to convert
to floating-point using ldexp
.
Writing
The algorithm for writing a floating-point number is more complicated because it must generate the shortest decimal mantissa which reads as the original floating-point input.The approximate number of decimal digits needed to represent a binary number is obtained by multiplying
dbl-mant-dig
by log210. Try to convert using the floor of this number
and test to see whether it would read as the original number. If
so, then return; if not, then try with one more decimal digit.
(define (exact-ceiling x) (inexact->exact (ceiling x))) ;; Given positive integer mantissa and exponent (of 2) E2, find the ;; closest integer and exponent (of 10). (define (dbl->string mant e2) (define llog2 (/ (log 2) (log 10))) (define f (ldexp mant e2)) (define quo 0) (define point 0) (if (> e2 0) (let ((num (* mant (expt 2 e2)))) (set! point (max 0 (exact-ceiling (- (/ (log num) (log 10)) (* dbl-mant-dig llog2))))) (let ((den (expt 10 point))) (set! quo (round-quotient num den)) (cond ((not (= (mantexp->dbl quo point) f)) (set! point (+ -1 point)) (set! quo (round-quotient num (quotient den 10))))))) (let ((den (expt 2 (- e2)))) (set! point (exact-ceiling (* e2 llog2))) (let ((num (* mant (expt 10 (- point))))) (set! quo (round-quotient num den)) (cond ((and (positive? f) (not (= (mantexp->dbl quo point) f))) (set! point (+ -1 point)) (set! quo (round-quotient (* num 10) den))))))) (let* ((dman (number->string quo)) (lman (string-length dman))) (do ((idx (+ -1 lman) (+ -1 idx))) ((or (zero? idx) (not (eqv? #\0 (string-ref dman idx)))) (string-append "." (substring dman 0 (+ 1 idx)) "e" (number->string (+ point lman)))))))When e2 is positive, num is bound to the product of mant and 2e2. point is set to an upper-bound of the number of decimal digits of num in excess of the floating-point mantissa's precision. The
round-quotient
of num and
10point produces the integer quo.
If mantexp->dbl
of quo
and point is not equal to the original floating-point
value f, then the round-quotient
is computed
again with the divisor divided by 10 yielding one more digit of
precision.
When e2 is negative, den is bound to 2−e2 and point is set to the negation of an upper-bound of the number of decimal digits in 2e2. num is bound to the product of mant and 10point. The
round-quotient
of num and den produces the
integer quo. If mantexp->dbl
of quo and point is not equal to the original
floating-point value f, then
the round-quotient
is computed again
with num multiplied by 10 yielding one more digit of
precision.
The last part of
dbl->string
produces a string using
Scheme's number->string
to convert the integer
mantissa and exponent components to decimal strings. Mantissa
trailing zeros are eliminated by scanning the dman string
in reverse for non-zero digits.
References:
-
[754]
IEEE 754-1985,
IEEE Standard for Binary Floating-Point Arithmetic. -
[1996]
Robert G. Burger and R. Kent Dybvig.
Printing floating-point numbers quickly and accurately.
Proceedings of the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation, pages 108-116. -
[1990]
William Clinger.
How to read floating point numbers accurately.
Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 92-101.
Proceedings published as SIGPLAN Notices 25(6), June 1990.