Governance by those who do the work.

Wednesday, March 23, 2011

Design Pattern for Multi-Format Extract-Transform-Load

Presented is a design-pattern for single-pass extract-transform-load (ETL) processing of line-oriented files which automatically detects the line format, but which does not require any domain-specific code devoted to deciding the format of the input file.

The Color-Name Dictionaries web-page analyzes and compares more than 3 dozen color dictionaries from book and online sources. The dictionary sources are in 19 (plain-text and HTML) formats. The largest is the "NBS/ISCC Dictionary of Color Names" with 5411 color names; and there are several with over 1300 names. Some color dictionaries are updated or added every year.

The dictionary files share these characteristics:

  • Each color-name entry appears on a single line.
  • All the intended color-name entries in a file have the same
  • A Preamble and comments may be mixed with the color-name

There are format errors, inconsistencies, duplicates, and collisions (where a color name is repeated with different color coordinates) in some of the files. In spite of this, file->color-dictionary and url->color-dictionary in the SLIB color-database module are able to automatically detect the format and extract and clean the data and insert it into relational tables. This is accomplished in an easy-to-extend fashion using closures. This code is written in Scheme, but the same idea should work in any language with closures.

The main ETL loop has port bound to an input-stream reading from the dictionary source file:

        (do ((line (read-line port) (read-line port)))
            ((eof-object? line)
             (display "Inserted ") (display *idx*) (display " colors") (newline)
          (let ((colin (parse-rgb-line line)))
            (cond ((equal? "" line))
                  ((not colin) (write-line line))
                  ((numbered-gray? (cadr colin)))
                    (lambda (name)
                      (let ((oclin (color-table:row-retrieve name)))
                         ((and oclin (equal? (car colin) (cadr oclin))))
                         ((not oclin)
                          (set! *idx* (+ 1 *idx*))
                           (list name (car colin) *idx*)))
                         (else (slib:warn 'collision colin oclin)))))
                    (cdr colin))))))

The do loop iterates over the lines read from port, calling parse-rgb-line on each. The call to parse-rgb-line returns false unless the line was parsed as a valid color-dictionary line. If it returns false, line is printed to the log. If it returns a data row, each name in the row is checked for an existing table entry; if it matches, nothing is done; if it doesn't match, a warning is generated; if there is no entry, then it is created.

    (define (parse-rgb-line line)
         (lambda (method)
           (or ans
               (let ((try (method line)))
                 (cond (try (set! ans try)
                            (display "**** Using method ")
                            (display method-id) (newline)
                            (set! parse-rgb-line method))))))
          (lambda (line) ...)
          (lambda (line) ...)

The construction of parse-rgb-line is a bit unusual. The for-each procedure calls its first argument (lambda (method) ...) with each function (lambda (line) ...) in the list which is its second argument. The first-argument procedure returns immediately if ans is not false. Otherwise, it calls method with the input line. If method returns a data row, then ans is set to it, and no other methods will be called. The test of ans can be eliminated through the use of an escape continuation.

Most importantly, when method returns a data row, it sets parse-rgb-line to method, so that after the current invocation of parse-rgb-line returns (or escapes), the method which worked will be called directly. Thus parse-rgb-line is a self-replacing function!

If parse-rgb-line is defined internally to another procedure, the containing procedure can be reentrant (as SLIB is).

Here is a typical (lambda (line) ...) function:

          (lambda (line)
            (case (sscanf line " %24[a-zA-Z0-9_ ] %d %d %d %e %e %e %s"
                          name r g b ri gi bi junk)
               (set! method-id 'm7)
               (list (check-match line
                                  (color:sRGB r g b)
                                  (floats->rgb ri gi bi))
                     (color-name:canonicalize name)))
              (else #f)))

sscanf is called with 8 variables, but this function will succeed only if exactly 7 are scanned; this is to detect extra stuff on the line. method-id is for reporting which method succeeded. check-match is called to warn if the RGB integer coordinates don't match the floating-point color coordinates.

If a format needs to ignore junk at the end of lines, then it should come after any functions (in the list) whose formats it might spoof. This way, the more specific format takes precedence.


A design-pattern employing a self-replacing function provides single-pass ETL processing of line-oriented files which automatically detects the line format.

1 comment:

glathoud said...

Interesting idea. I'd say your idea takes Peter Michaux's lazy function definition pattern to the next level.