Abstract
Presented here are algorithms for converting between (decimal) scientificnotation and (binary) IEEE754 doubleprecision floatingpoint numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.Introduction
In both How to Read FloatingPoint Numbers Accurately[1990] and Printing floatingpoint numbers quickly and accurately[1996] the key issue is that successive rounding operations do not have the same effect as a single rounding operation.The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE754 doubleprecision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.
Bignums
Both reading and writing of floatingpoint numbers can involve division of numbers larger than can be stored in the floatingpoint registers, causing rounding at undesired points of the conversion. Bignums (arbitrary precision integers) can perform division of large integers without rounding. What is needed is a bignum divisionwithrounding operator, calledroundquotient
here. It can be implementated in Scheme as
follows:
(define (roundquotient num den) (define quo (quotient num den)) (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den)) (+ quo (if (eqv? (negative? num) (negative? den)) 1 1)) quo))If the remainder is more than half of the denominator, then it rounds up; if it is less, then it rounds down; if it is equal, then it rounds to even. These are the same rounding rules as IEEE Standard for Binary FloatingPoint Arithmetic (7541985) and the Scheme procedure
round
.
The algorithms presented here use the integerlength function from CommonLisp and the SLIB Scheme Library:
— Function: integerlength n
The algorithms also use Returns the number of bits neccessary to represent n.For positive argument n,
Example:
(integerlength #b10101010) ⇒ 8 (integerlength 0) ⇒ 0 (integerlength #b1111) ⇒ 4
integerlength
returns
⌈log_{2} n⌉.
ldexp
(from C). Here is Scheme
code for ldexp
:
;; ldexp() from C. (define (ldexp bmant bexp) (define DBL_MIN_10_EXP 1074) (if (> DBL_MIN_10_EXP (* 2 (abs bexp))) (* (* 1. bmant (expt 2 (quotient bexp 2))) (expt 2 ( bexp (quotient bexp 2)))) (* 1. bmant (expt 2 bexp))))The conditional in
ldexp
is in order to work
with bexp values where 2^{bexp} is out
of floatingpoint range.
Reading
The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floatingpoint number is straightforward.;; dblmantdig is the number of bits in the mantissa field of the ;; floatingpoint format. (define dblmantdig 53) ;; Given positive integer mantissa MANT and exponent (of 10) POINT, ;; find the closest binary floatingpoint number. ;; bex is the binary shift for quo=mant/scl; bex is negative. (define (mantexp>dbl mant point) (if (>= point 0) (exact>inexact (* mant (expt 10 point))) (let* ((scl (expt 10 ( point))) (bex ( (integerlength mant) (integerlength scl) dblmantdig)) (num (* mant (expt 2 ( bex)))) (quo (roundquotient num scl))) (cond ((> (integerlength quo) dblmantdig) ;too many bits of quotient (set! bex (+ 1 bex)) (set! quo (roundquotient num (* scl 2))))) (ldexp (exact>inexact quo) bex))))When point is nonnegative, the mantissa is multiplied by 10^{point} and the product is rounded to fit in
dblmantdig
bits by exact>inexact
.
With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10^{−point}. Over the floatingpoint range, the longest a rounded quotient of a n bit number and a m bit powerof10 can be is 1+n−m bits; the shortest is n−m bits.
2^{−bex} is the poweroftwo multiplier. The initial value of bex corresponds to the n−m case above. If the roundquotient is more than
dblmantdig
bits long, then
call roundquotient
with double the denominator,
2⋅scl. In either case, the final step is to convert
to floatingpoint using ldexp
.
Writing
The algorithm for writing a floatingpoint number is more complicated because it must generate the shortest decimal mantissa which reads as the original floatingpoint input.The approximate number of decimal digits needed to represent a binary number is obtained by multiplying
dblmantdig
by log_{2}10. Try to convert using the floor of this number
and test to see whether it would read as the original number. If
so, then return; if not, then try with one more decimal digit.
(define (exactceiling x) (inexact>exact (ceiling x))) ;; Given positive integer mantissa and exponent (of 2) E2, find the ;; closest integer and exponent (of 10). (define (dbl>string mant e2) (define llog2 (/ (log 2) (log 10))) (define f (ldexp mant e2)) (define quo 0) (define point 0) (if (> e2 0) (let ((num (* mant (expt 2 e2)))) (set! point (max 0 (exactceiling ( (/ (log num) (log 10)) (* dblmantdig llog2))))) (let ((den (expt 10 point))) (set! quo (roundquotient num den)) (cond ((not (= (mantexp>dbl quo point) f)) (set! point (+ 1 point)) (set! quo (roundquotient num (quotient den 10))))))) (let ((den (expt 2 ( e2)))) (set! point (exactceiling (* e2 llog2))) (let ((num (* mant (expt 10 ( point))))) (set! quo (roundquotient num den)) (cond ((and (positive? f) (not (= (mantexp>dbl quo point) f))) (set! point (+ 1 point)) (set! quo (roundquotient (* num 10) den))))))) (let* ((dman (number>string quo)) (lman (stringlength dman))) (do ((idx (+ 1 lman) (+ 1 idx))) ((or (zero? idx) (not (eqv? #\0 (stringref dman idx)))) (stringappend "." (substring dman 0 (+ 1 idx)) "e" (number>string (+ point lman)))))))When e2 is positive, num is bound to the product of mant and 2^{e2}. point is set to an upperbound of the number of decimal digits of num in excess of the floatingpoint mantissa's precision. The
roundquotient
of num and
10^{point} produces the integer quo.
If mantexp>dbl
of quo
and point is not equal to the original floatingpoint
value f, then the roundquotient
is computed
again with the divisor divided by 10 yielding one more digit of
precision.
When e2 is negative, den is bound to 2^{−e2} and point is set to the negation of an upperbound of the number of decimal digits in 2^{e2}. num is bound to the product of mant and 10^{point}. The
roundquotient
of num and den produces the
integer quo. If mantexp>dbl
of quo and point is not equal to the original
floatingpoint value f, then
the roundquotient
is computed again
with num multiplied by 10 yielding one more digit of
precision.
The last part of
dbl>string
produces a string using
Scheme's number>string
to convert the integer
mantissa and exponent components to decimal strings. Mantissa
trailing zeros are eliminated by scanning the dman string
in reverse for nonzero digits.
References:

[754]
IEEE 7541985,
IEEE Standard for Binary FloatingPoint Arithmetic. 
[1996]
Robert G. Burger and R. Kent Dybvig.
Printing floatingpoint numbers quickly and accurately.
Proceedings of the ACM SIGPLAN '96 Conference on Programming Language Design and Implementation, pages 108116. 
[1990]
William Clinger.
How to read floating point numbers accurately.
Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 92101.
Proceedings published as SIGPLAN Notices 25(6), June 1990.
No comments:
Post a Comment