Governance by those who do the work.

Saturday, October 5, 2013

Easy Accurate Reading and Writing of Floating-Point Numbers

http://people.csail.mit.edu/jaffer/III/EZFPRW

Abstract

Presented here are algorithms for converting between (decimal) scientific-notation and (binary) IEEE-754 double-precision floating-point numbers. These algorithms are much simpler than those previously published. The values are stable under repeated conversions between the formats. The scientific representations generated have only the minimum number of mantissa digits needed to convert back to the original binary values.

Introduction

In both How to Read Floating-Point Numbers Accurately[1990] and Printing floating-point numbers quickly and accurately[1996] the key issue is that successive rounding operations do not have the same effect as a single rounding operation.
The algorithms from both papers are iterative and complicated. The read and write algorithms presented here do at most 2 and 4 bignum divisions, respectively. Over the range of IEEE-754 double-precision numbers, the largest intermediate bignum is 339 decimal digits (1126 bits). This is not large for bignums, being orders of magnitude smaller than the precisions which get speed benefits from FFT multiplication.

Bignums

Both reading and writing of floating-point numbers can involve division of numbers larger than can be stored in the floating-point registers, causing rounding at undesired points of the conversion. Bignums (arbitrary precision integers) can perform division of large integers without rounding. What is needed is a bignum division-with-rounding operator, called round-quotient here. It can be implementated in Scheme as follows:
(define (round-quotient num den)
  (define quo (quotient num den))
  (if ((if (even? quo) > >=) (* (abs (remainder num den)) 2) (abs den))
      (+ quo (if (eqv? (negative? num) (negative? den)) 1 -1))
      quo))
If the remainder is more than half of the denominator, then it rounds up; if it is less, then it rounds down; if it is equal, then it rounds to even. These are the same rounding rules as IEEE Standard for Binary Floating-Point Arithmetic (754-1985) and the Scheme procedure round.
The algorithms presented here use the integer-length function from Common-Lisp and the SLIB Scheme Library:
— Function: integer-length n
Returns the number of bits neccessary to represent n.
Example:
        (integer-length #b10101010)
        ⇒ 8
        (integer-length 0)
        ⇒ 0
        (integer-length #b1111)
        ⇒ 4

For positive argument n, integer-length returns ⌈log2 n⌉.
The algorithms also use ldexp (from C). Here is Scheme code for ldexp:
;; ldexp() from C.
(define (ldexp bmant bexp)
  (define DBL_MIN_10_EXP -1074)
  (if (> DBL_MIN_10_EXP (* -2 (abs bexp)))
      (* (* 1. bmant (expt 2 (quotient bexp 2)))
  (expt 2 (- bexp (quotient bexp 2))))
      (* 1. bmant (expt 2 bexp))))
The conditional in ldexp is in order to work with bexp values where 2bexp is out of floating-point range.

Reading

The algorithm to convert a positive integer mantissa and integer exponent (of 10) to a floating-point number is straightforward.
;; dbl-mant-dig is the number of bits in the mantissa field of the
;; floating-point format.
(define dbl-mant-dig 53)

;; Given positive integer mantissa MANT and exponent (of 10) POINT,
;; find the closest binary floating-point number.
;; bex is the binary shift for quo=mant/scl; bex is negative.
(define (mantexp->dbl mant point)
  (if (>= point 0)
      (exact->inexact (* mant (expt 10 point)))
      (let* ((scl (expt 10 (- point)))
      (bex (- (integer-length mant) (integer-length scl) dbl-mant-dig))
      (num (* mant (expt 2 (- bex))))
      (quo (round-quotient num scl)))
 (cond ((> (integer-length quo) dbl-mant-dig) ;too many bits of quotient
        (set! bex (+ 1 bex))
        (set! quo (round-quotient num (* scl 2)))))
 (ldexp (exact->inexact quo) bex))))
When point is non-negative, the mantissa is multiplied by 10point and the product is rounded to fit in dbl-mant-dig bits by exact->inexact.
With a negative point, the mantissa will be multiplied by a power of 2, then divided by scl=10point. Over the floating-point range, the longest a rounded quotient of a n bit number and a m bit power-of-10 can be is 1+n−m bits; the shortest is n−m bits.
2bex is the power-of-two multiplier. The initial value of bex corresponds to the n−m case above. If the round-quotient is more than dbl-mant-dig bits long, then call round-quotient with double the denominator, 2⋅scl. In either case, the final step is to convert to floating-point using ldexp.

Writing

The algorithm for writing a floating-point number is more complicated because it must generate the shortest decimal mantissa which reads as the original floating-point input.
The approximate number of decimal digits needed to represent a binary number is obtained by multiplying dbl-mant-dig by log210. Try to convert using the floor of this number and test to see whether it would read as the original number. If so, then return; if not, then try with one more decimal digit.
(define (exact-ceiling x) (inexact->exact (ceiling x)))

;; Given positive integer mantissa and exponent (of 2) E2, find the
;; closest integer and exponent (of 10).
(define (dbl->string mant e2)
  (define llog2 (/ (log 2) (log 10)))
  (define f (ldexp mant e2))
  (define quo 0)
  (define point 0)
  (if (> e2 0)
      (let ((num (* mant (expt 2 e2))))
 (set! point
       (max 0 (exact-ceiling (- (/ (log num) (log 10)) (* dbl-mant-dig llog2)))))
 (let ((den (expt 10 point)))
   (set! quo (round-quotient num den))
   (cond ((not (= (mantexp->dbl quo point) f))
   (set! point (+ -1 point))
   (set! quo (round-quotient num (quotient den 10)))))))
      (let ((den (expt 2 (- e2))))
 (set! point (exact-ceiling (* e2 llog2)))
 (let ((num (* mant (expt 10 (- point)))))
   (set! quo (round-quotient num den))
   (cond ((and (positive? f) (not (= (mantexp->dbl quo point) f)))
   (set! point (+ -1 point))
   (set! quo (round-quotient (* num 10) den)))))))
  (let* ((dman (number->string quo)) (lman (string-length dman)))
    (do ((idx (+ -1 lman) (+ -1 idx)))
 ((or (zero? idx) (not (eqv? #\0 (string-ref dman idx))))
  (string-append "." (substring dman 0 (+ 1 idx))
   "e" (number->string (+ point lman)))))))
When e2 is positive, num is bound to the product of mant and 2e2. point is set to an upper-bound of the number of decimal digits of num in excess of the floating-point mantissa's precision. The round-quotient of num and 10point produces the integer quo. If mantexp->dbl of quo and point is not equal to the original floating-point value f, then the round-quotient is computed again with the divisor divided by 10 yielding one more digit of precision.
When e2 is negative, den is bound to 2e2 and point is set to the negation of an upper-bound of the number of decimal digits in 2e2. num is bound to the product of mant and 10point. The round-quotient of num and den produces the integer quo. If mantexp->dbl of quo and point is not equal to the original floating-point value f, then the round-quotient is computed again with num multiplied by 10 yielding one more digit of precision.
The last part of dbl->string produces a string using Scheme's number->string to convert the integer mantissa and exponent components to decimal strings. Mantissa trailing zeros are eliminated by scanning the dman string in reverse for non-zero digits.

References:

Saturday, June 29, 2013

Hash Entropy

http://people.csail.mit.edu/jaffer/III/Hash-Entropy
A browser cookie contains a 128 bit "unique" identifier. 128 bits is an inconvenient size, being longer than native integer size on nearly all CPUs. The 128 bit number can be split into two 64-bit numbers; are both needed to correlate user records?
In my sample of 103 million cookie identifiers, there were only 2 collisions when grouping by the left half of the identifier. For statistical calculations, this 0.000002% error rate is negligible.
The expected number of collisions among k samples out of n possible codes is:
k ( 1 − ( 1 − 1/n )k−1 )
The expected number of matching birthdays among 20 people is 1.016. How many collisions are expected from 103 million samples of randomly distributed 64 bit numbers?
In order to keep the numbers within double-precision range, use the fact that ln(1+ε) ≈ ε for ε near 0:
k ⋅ ( 1 − ( 1 − 1/n )k−1 ) k ⋅ ( 1 − e(1−k)/n )
The expected number of collisions for 103×106 random 64.bit numbers is 575.115×10−6. So the cookie identifiers are not uniformly distributed. Another way of saying this is that the distribution of cookies has less entropy than a 64.bit random variable. How much entropy does the distribution have?
Solving the expected-collisions formula for n will give an estimate of the entropy (c is the number of collisions):
n 1 − k

ln( 1−c/k )
Two collisions among 103×106 cookie-IDs indicate around 52.2 bits of entropy (log2 5.3×1015). Looking at a sample of cookie-IDs notice that the digits under the "M" are all "4" and the digits under "N" are always "8", "9", "a", or "b".
              M    N
3001457e-e07a-4b20-ba16-f3f8624da98f
73f9436b-2868-4325-a62c-bf5a4f38dfdf
1c4df5b6-210b-4aba-9660-bdb129ea6f31
34c3fcb1-bf16-4a96-a7e9-ec377bface30
a1ddb5fc-4bb8-4321-a1bf-753668387197
3edd32e0-5bad-476d-bc1c-8cd8c80b88f8
2ee886c8-d877-480c-b109-455cd3c1b099
5c76c27d-7caf-4adf-8aec-65fee5ce281a
93af502b-61db-4596-bcfe-188c352296b5
ed981151-6179-45cf-8e0d-e4293caf62a0
54ea634b-ebfc-4371-97a7-5b1123264e4b
9a3cce97-2a4b-4a56-8fd9-85d9f07abce4
0bfdbd6f-5fd8-4c74-ab7e-1ccb1a81fd70
84ac5d8a-c64f-48a3-8d11-653aea56e2b4
f5e3afb2-e87b-45e1-9b1c-a49d3e106124
b5f832f5-0771-4554-8a00-da142aab1832
ca025994-9412-4311-bb62-c7e6458a94dc
cf1fc6a6-b3d4-47a5-8aa5-063120d5b8c4
83add10c-b266-4995-8803-805d32c58147
d85c3ec5-3d77-4767-b87b-a42a20dc9f74
b6104d07-b29e-4b07-b3c8-65efd08bd003
21111ab5-83ab-4af8-a39f-1824a4055a9a
78346096-3202-48d8-9ed8-797a0f233804
1d124d3d-8ac2-405b-9a8a-afb70dac4720
e187bd94-c091-4fb5-a3c4-2a50e2fe4e35
8dca4524-cb3d-4a40-ae4d-881976a6173f
94e458b1-04c8-4495-b3fb-4cbbc4c37561
7d0819df-4a08-4553-b10b-379433b09e35
74332f07-3183-479b-a285-04c1a9e61bc4
4ac04ba2-950e-4315-bb14-eb3369dd1591
f8a20163-1db1-47e4-a6b5-d8fa4519eb79
ba778464-0bf4-4575-93e9-ac5b41923e3a
6dda046c-651d-4d9c-9482-4d3c1247a14a
This turns out to be RFC-4122 A Universally Unique IDentifier (UUID) URN Namespace format version 4. The left half of the UUID should have 60 bits of entropy if the random or pseudo-random source is uniformly distributed. The left half having only 52 bits of entropy indicates that the source quality could be improved.
Note that a linearly increasing sequence would not have any collisions although, as a sequence, it has low sequence-entropy. Although having the expected hash entropy is a test which a pseudo-random-number-generator should pass, PRNG sequences must also be "random". So low collision counts are a necessary, but not sufficient condition for PRNG quality.
This collision test's insensitivity to sequence makes it useful for testing hash functions. A good hash function will have have entropy near the number of codes it can output.

Sarah V. Jaffer has a much simpler way to compute the number of collisions. k is much smaller than n. The probability of one choice hitting another is k/n. There are k chances to hit another choice; so the expected number of collisions is roughly k2/n. In the cookie calculation, Sarah's formula matches within one part-per-million.
How are these formulas related? Because ln(1+ε)≈ε, it follows that eε−1≈ε (for small ε). Thus the expected number of collisions is:
k ⋅ ( 1 − e(1−k)/n ) k ⋅ (k−1)

n
Realizing that k≫1 leads to Sarah's formula k2/n.

Wednesday, January 9, 2013

Competing for Energy Savings

MassSAVE (formerly the Massachusetts Residential Conservation Services Program) has been providing conservation services for Massachusetts residential energy consumers since 1980.  It is funded by Massachusetts' gas and electric utilities and energy efficiency providers.

Given that energy-audits and most of the related services are free to Massachusetts households yearly, why are annual participation rates only a few percent?

The largest obstacles to greater participation may be related to everyday suspicions about offers for free services:

* "If it sounds to good to be true, it probably is."

* "If it will save me money, why are they giving it away?"

Countering these suspicions is an uphill battle, as most of the "free" offers people encounter come with hidden agendas.  If a nominal fee were charged, then it would still give people cause for inaction, thinking "an energy-audit probably won't be worth the fee".

Another commonly heard objection to getting an energy-audit is the belief that one will not live in her current residence long enough to justify effort or expense.  This article suggests that we look for motivations beyond the energy savings themselves.

People are less critical about participating in activities organized around civic or charitable purposes.  Even residents without children are often willing to support local school fundraisers.

Unlike energy-audits, those fundraisers involve familiar goods or services like baked goods and car-washes.  To motivate people to participate in an unfamiliar process, it makes sense to frame the activity as a competition, a popular meme in contemporary life. Making that competition be between a few towns (for energy-audit participation rate) increases the chances for winning from the hit-by-lightning odds of lotteries or singing competitions to plausible, achievable levels.

A prize to benefit a town will motivate civic-minded citizens to participate.  If the prize is a green energy source (such as a solar array), then even climate-change deniers must acknowledge that the prize will save the town (and their taxes) money; such savings are valued beyond their true economic worth in contemporary American society.

For a prize benefiting the town, the town's or school's infrastructure for communicating with residents could be harnessed to promote participation.

As for the prize itself, the publicity generated by the competition should be very attractive to green-energy companies, who may be convinced to supply the prize at reduced or no cost.

Because energy-audits can be done annually, these competitions could be repeated each year (perhaps without the previous winning towns).

The energy-audit participation rate competition is an all-winners proposition:

* Energy-audits become widely used and accepted;

* a town wins a green-energy source which lowers their energy costs;

* all participants save money on energy costs;

* electric utilities avoid the need to construct new generation capacity; and

* fossil-carbon pollution is reduced.

I am talking with NextStepLiving about setting up an energy-audit competition northwest of Boston.