Governance by those who do the work.

Wednesday, December 30, 2009

Integers-as-Bits versus Bit-Arrays

Integers As Bits (SRFI-60) is motivated this way:
Treating integers as two's-complement strings of bits is an arcane but important domain of computer science. It is used for:
  • hashing;
  • Galois-field[2] calculations of error-detecting and error-correcting codes;
  • cryptography and ciphers;
  • pseudo-random number generation;
  • register-transfer-level modeling of digital logic designs;
  • index arithmetic of Fast-Fourier transforms;
  • packing and unpacking numbers in persistant data structures;
  • space-filling curves with applications to dimension reduction and sparse multi-dimensional database indexes; and
  • generating initial values for root-finders and transcendental function algorithms.
Many people's first impulse is to conflate integers with bit-vectors. But this is not useful when we realize that functional operations on immutable integers-as-bits are less inefficient than side-effecting operations on large bit-vectors.

Bit-vectors can be used to construct very fast indexes for database tables having millions of records. To keep the working-set small, creation of megabit arrays is done explicitly; the bit-vector procedures accumulate their results into one of the operands. Functional operations on bignums would be impractical for such an application.

Let

N
be the largest number of significant bits in bignum inputs to a bitwise computation; and
P
be the number of binary bitwise operations (LOGAND, LOGOR, ...) in a computation.
Consider the computation to find the disjunction of P bit-vectors (a OR b OR c ...). Using (bignum) integers-as-bits, P bignums will be allocated and P-1 bignums will be discarded in the course of the computation. Thus it consumes O(PN) space and time.

Accumulating the result of multiple disjunctions in one bit-vector consumes only O(N) storage and O(PN) time.

The worst case for bit-vector accumulation is a full binary tree of 2q bit-vectors with P=2q-1 binary operations. By reusing temporary bit-vectors, storage can be limited to O(Nq)=O(NlogP) and O(PN) time.

A clever optimizer could be written to convert functional bitwise expressions into accumulating bit-vector forms, but I haven't found functional bitwise operators to be easier to work with than imperative bit-vector operators for database indexing.

In addition to the generic Array and Array Mapping procedures of SLIB, the following procedures in SCM were written to support the use of bit-vectors as database indexes. This technique works best for indexes having only a few indexed values (with many records per value). One bit-vector is devoted to each indexed value for a field; a 1-bit indicates that the corresponding record has that value in that field. Indexes for values which are sparse can instead be encoded by vectors of integers, each integer indicating the record having that value in that field. Indexes coded in these ways allow fast calculation of complicated queries.

5.4.3 Bit Vectors

Bit vectors can be written and read as a sequence of 0s and 1s prefixed by #*.

#1At(#f #f #f #t #f #t #f) => #*0001010

Some of these operations will eventually be generalized to other uniform-arrays.

Function: bit-count bool bv
Returns the number of occurrences of bool in bv.

Function: bit-position bool bv k
Returns the minimum index of an occurrence of bool in bv which is at least k. If no bool occurs within the specified range #f is returned.

Function: bit-invert! bv
Modifies bv by replacing each element with its negation.

Function: bit-set*! bv uve bool
If uve is a bit-vector, then bv and uve must be of the same length. If bool is #t, then uve is OR'ed into bv; If bool is #f, the inversion of uve is AND'ed into bv.

If uve is a unsigned integer vector, then all the elements of uve must be between 0 and the LENGTH of bv. The bits of bv corresponding to the indexes in uve are set to bool.

The return value is unspecified.

Function: bit-count* bv uve bool
Returns
(bit-count (bit-set*! (if bool bv (bit-invert! bv)) uve #t) #t).

bv is not modified.

No comments:

Post a Comment