Governance by those who do the work.

Wednesday, December 30, 2009

Integers-as-Bits versus Bit-Arrays

Integers As Bits (SRFI-60) is motivated this way:
Treating integers as two's-complement strings of bits is an arcane but important domain of computer science. It is used for:
  • hashing;
  • Galois-field[2] calculations of error-detecting and error-correcting codes;
  • cryptography and ciphers;
  • pseudo-random number generation;
  • register-transfer-level modeling of digital logic designs;
  • index arithmetic of Fast-Fourier transforms;
  • packing and unpacking numbers in persistant data structures;
  • space-filling curves with applications to dimension reduction and sparse multi-dimensional database indexes; and
  • generating initial values for root-finders and transcendental function algorithms.
Many people's first impulse is to conflate integers with bit-vectors. But this is not useful when we realize that functional operations on immutable integers-as-bits are less inefficient than side-effecting operations on large bit-vectors.

Bit-vectors can be used to construct very fast indexes for database tables having millions of records. To keep the working-set small, creation of megabit arrays is done explicitly; the bit-vector procedures accumulate their results into one of the operands. Functional operations on bignums would be impractical for such an application.

Let

N
be the largest number of significant bits in bignum inputs to a bitwise computation; and
P
be the number of binary bitwise operations (LOGAND, LOGOR, ...) in a computation.
Consider the computation to find the disjunction of P bit-vectors (a OR b OR c ...). Using (bignum) integers-as-bits, P bignums will be allocated and P-1 bignums will be discarded in the course of the computation. Thus it consumes O(PN) space and time.

Accumulating the result of multiple disjunctions in one bit-vector consumes only O(N) storage and O(PN) time.

The worst case for bit-vector accumulation is a full binary tree of 2q bit-vectors with P=2q-1 binary operations. By reusing temporary bit-vectors, storage can be limited to O(Nq)=O(NlogP) and O(PN) time.

A clever optimizer could be written to convert functional bitwise expressions into accumulating bit-vector forms, but I haven't found functional bitwise operators to be easier to work with than imperative bit-vector operators for database indexing.

In addition to the generic Array and Array Mapping procedures of SLIB, the following procedures in SCM were written to support the use of bit-vectors as database indexes. This technique works best for indexes having only a few indexed values (with many records per value). One bit-vector is devoted to each indexed value for a field; a 1-bit indicates that the corresponding record has that value in that field. Indexes for values which are sparse can instead be encoded by vectors of integers, each integer indicating the record having that value in that field. Indexes coded in these ways allow fast calculation of complicated queries.

5.4.3 Bit Vectors

Bit vectors can be written and read as a sequence of 0s and 1s prefixed by #*.

#1At(#f #f #f #t #f #t #f) => #*0001010

Some of these operations will eventually be generalized to other uniform-arrays.

Function: bit-count bool bv
Returns the number of occurrences of bool in bv.

Function: bit-position bool bv k
Returns the minimum index of an occurrence of bool in bv which is at least k. If no bool occurs within the specified range #f is returned.

Function: bit-invert! bv
Modifies bv by replacing each element with its negation.

Function: bit-set*! bv uve bool
If uve is a bit-vector, then bv and uve must be of the same length. If bool is #t, then uve is OR'ed into bv; If bool is #f, the inversion of uve is AND'ed into bv.

If uve is a unsigned integer vector, then all the elements of uve must be between 0 and the LENGTH of bv. The bits of bv corresponding to the indexes in uve are set to bool.

The return value is unspecified.

Function: bit-count* bv uve bool
Returns
(bit-count (bit-set*! (if bool bv (bit-invert! bv)) uve #t) #t).

bv is not modified.

Tuesday, December 15, 2009

Schlep Toolchains

Schlep Toolchains documents and distributes the hyper-cross-platform technology used to create the WB B-tree database libraries and new implementation of Water for C, C#, and Java from source written in a subset of Scheme.

One might assume that in order for Schlep-dialect code to map to these three languages [C, C#, and Java], the Schlep-dialect must manifest the worst limitations of each language; but translation can actually rectify some of these limitations.

If I had to pick one practical feature of Scheme which elevates it above other languages, it would be internal definitions. The Algorithmic Language Scheme allows procedure definitions (using define, letrec, and named-let) inside of other procedures, which none of C, C#, or Java does. Internal definitions allow calls of internal procedures with a small number of arguments to replace the common alternatives:

  • duplication of code, or
  • top-level definitions and calls passing large numbers of arguments.
C and C# have a `goto' statement, enabling Schlep to emulate calling of internal-procedures in the tail position using some variable assignments (sometimes including temporary variable binding to emulate simultaneous assignment) followed by a goto statement. The restriction to the tail-position does not allow internal recursion other than tail-recursion; but this facilitates use of internal procedures in many situations which would otherwise force less desirable practices.

Java lacks a `goto' statement. Tail-called internal procedures are instead implemented using while (true), continue, and break with labels. The resulting Java code is not as readable as the original Scheme-dialect; but that loss in clarity is balanced by greater expressive power.

Schlep-example gives an example of a procedure with an internal procedure and how Schlep translates this procedure into C, C#, and Java.