Governance by those who do the work.

Wednesday, March 23, 2011

Design Pattern for Multi-Format Extract-Transform-Load

Presented is a design-pattern for single-pass extract-transform-load (ETL) processing of line-oriented files which automatically detects the line format, but which does not require any domain-specific code devoted to deciding the format of the input file.


The Color-Name Dictionaries web-page analyzes and compares more than 3 dozen color dictionaries from book and online sources. The dictionary sources are in 19 (plain-text and HTML) formats. The largest is the "NBS/ISCC Dictionary of Color Names" with 5411 color names; and there are several with over 1300 names. Some color dictionaries are updated or added every year.

The dictionary files share these characteristics:

  • Each color-name entry appears on a single line.
  • All the intended color-name entries in a file have the same
    format.
  • A Preamble and comments may be mixed with the color-name
    entries.

There are format errors, inconsistencies, duplicates, and collisions (where a color name is repeated with different color coordinates) in some of the files. In spite of this, file->color-dictionary and url->color-dictionary in the SLIB color-database module are able to automatically detect the format and extract and clean the data and insert it into relational tables. This is accomplished in an easy-to-extend fashion using closures. This code is written in Scheme, but the same idea should work in any language with closures.

The main ETL loop has port bound to an input-stream reading from the dictionary source file:

        (do ((line (read-line port) (read-line port)))
            ((eof-object? line)
             (display "Inserted ") (display *idx*) (display " colors") (newline)
             *rcs-header*)
          (let ((colin (parse-rgb-line line)))
            (cond ((equal? "" line))
                  ((not colin) (write-line line))
                  ((numbered-gray? (cadr colin)))
                  (else
                   (for-each
                    (lambda (name)
                      (let ((oclin (color-table:row-retrieve name)))
                        (cond
                         ((and oclin (equal? (car colin) (cadr oclin))))
                         ((not oclin)
                          (set! *idx* (+ 1 *idx*))
                          (color-table:row-insert
                           (list name (car colin) *idx*)))
                         (else (slib:warn 'collision colin oclin)))))
                    (cdr colin))))))

The do loop iterates over the lines read from port, calling parse-rgb-line on each. The call to parse-rgb-line returns false unless the line was parsed as a valid color-dictionary line. If it returns false, line is printed to the log. If it returns a data row, each name in the row is checked for an existing table entry; if it matches, nothing is done; if it doesn't match, a warning is generated; if there is no entry, then it is created.

    (define (parse-rgb-line line)
     ...
        (for-each
         (lambda (method)
           (or ans
               (let ((try (method line)))
                 (cond (try (set! ans try)
                            (display "**** Using method ")
                            (display method-id) (newline)
                            (set! parse-rgb-line method))))))
         (list
          (lambda (line) ...)
          (lambda (line) ...)
          ...
         )))

The construction of parse-rgb-line is a bit unusual. The for-each procedure calls its first argument (lambda (method) ...) with each function (lambda (line) ...) in the list which is its second argument. The first-argument procedure returns immediately if ans is not false. Otherwise, it calls method with the input line. If method returns a data row, then ans is set to it, and no other methods will be called. The test of ans can be eliminated through the use of an escape continuation.

Most importantly, when method returns a data row, it sets parse-rgb-line to method, so that after the current invocation of parse-rgb-line returns (or escapes), the method which worked will be called directly. Thus parse-rgb-line is a self-replacing function!

If parse-rgb-line is defined internally to another procedure, the containing procedure can be reentrant (as SLIB is).

Here is a typical (lambda (line) ...) function:

          (lambda (line)
            (case (sscanf line " %24[a-zA-Z0-9_ ] %d %d %d %e %e %e %s"
                          name r g b ri gi bi junk)
              ((7)
               (set! method-id 'm7)
               (list (check-match line
                                  (color:sRGB r g b)
                                  (floats->rgb ri gi bi))
                     (color-name:canonicalize name)))
              (else #f)))

sscanf is called with 8 variables, but this function will succeed only if exactly 7 are scanned; this is to detect extra stuff on the line. method-id is for reporting which method succeeded. check-match is called to warn if the RGB integer coordinates don't match the floating-point color coordinates.

If a format needs to ignore junk at the end of lines, then it should come after any functions (in the list) whose formats it might spoof. This way, the more specific format takes precedence.

Conclusion

A design-pattern employing a self-replacing function provides single-pass ETL processing of line-oriented files which automatically detects the line format.

Tuesday, March 15, 2011

Names for Hammer Dulcimer in English Books 1800-2000


Google Labs has a nice webapp for creating graphs of word- and phrase-frequency versus time for the millions of books they have scanned.

Here is a graph of the frequency of various names for hammer-dulcimer from 1800 to the present. Click on the graph to try out the webapp for your own ideas. One caution; it is case-sensitive.

Friday, March 4, 2011

YouTube Hangs my Ubuntu-10.04 HP-dv7

At 7:17pm on Mar 2 2011 Ubuntu-10.04.1 running on an HP dv7 (x86_64) ceased to be able to play YouTube videos. Both Chromium and FireFox get wedged. Hulu and other Flash services play without problems. But after a couple of seconds of playing a YouTube video, the screen goes black with white streaks. /var/log/kern.log goes into loop:
Mar  2 19:17:37 dv7 kernel: [ 8232.470145] [drm:radeon_fence_wait] *ERROR* fence(ffff8800b8044980:0x000097BA) 510ms timeout going to reset GPU
Mar  2 19:17:37 dv7 kernel: [ 8232.470157] radeon 0000:01:00.0: GPU softreset 
Mar  2 19:17:37 dv7 kernel: [ 8232.470164] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xE53024A4
Mar  2 19:17:37 dv7 kernel: [ 8232.470171] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00110002
Mar  2 19:17:37 dv7 kernel: [ 8232.470177] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  2 19:17:37 dv7 kernel: [ 8232.470192] radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar  2 19:17:37 dv7 kernel: [ 8232.470249] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar  2 19:17:37 dv7 kernel: [ 8232.470314] radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
Mar  2 19:17:37 dv7 kernel: [ 8232.519205] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003028
Mar  2 19:17:37 dv7 kernel: [ 8232.519211] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
Mar  2 19:17:37 dv7 kernel: [ 8232.519217] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  2 19:17:37 dv7 kernel: [ 8232.523318] [drm:radeon_fence_wait] *ERROR* fence(ffff8800b8044980:0x000097BA) 570ms timeout
Mar  2 19:17:37 dv7 kernel: [ 8232.523320] [drm:radeon_fence_wait] *ERROR* last signaled fence(0x000097BA)
Mar  2 19:17:39 dv7 kernel: [ 8235.040219] [drm:radeon_fence_wait] *ERROR* fence(ffff8800c4203ac0:0x000097C3) 510ms timeout going to reset GPU
Mar  2 19:17:39 dv7 kernel: [ 8235.040231] radeon 0000:01:00.0: GPU softreset 
Mar  2 19:17:39 dv7 kernel: [ 8235.040238] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0xA0003028
Mar  2 19:17:39 dv7 kernel: [ 8235.040245] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
Mar  2 19:17:39 dv7 kernel: [ 8235.040251] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  2 19:17:39 dv7 kernel: [ 8235.040263] radeon 0000:01:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
Mar  2 19:17:39 dv7 kernel: [ 8235.040319] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
Mar  2 19:17:39 dv7 kernel: [ 8235.040384] radeon 0000:01:00.0:   R_000E60_SRBM_SOFT_RESET=0x00000402
Mar  2 19:17:39 dv7 kernel: [ 8235.089275] radeon 0000:01:00.0:   R_008010_GRBM_STATUS=0x00003028
Mar  2 19:17:39 dv7 kernel: [ 8235.089280] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2=0x00000002
Mar  2 19:17:39 dv7 kernel: [ 8235.089286] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS=0x200000C0
Mar  2 19:17:39 dv7 kernel: [ 8235.091966] [drm:radeon_fence_wait] *ERROR* fence(ffff8800c4203ac0:0x000097C3) 570ms timeout
Mar  2 19:17:39 dv7 kernel: [ 8235.091968] [drm:radeon_fence_wait] *ERROR* last signaled fence(0x000097C3)
Mar  2 19:17:42 dv7 kernel: [ 8237.611578] [drm:radeon_fence_wait] *ERROR* fence(ffff8800c4203f00:0x000097CA) 510ms timeout going to reset GPU
Mar  2 19:17:42 dv7 kernel: [ 8237.611591] radeon 0000:01:00.0: GPU softreset 
Reinstalling Flash didn't help. Eight hours earlier, I had updated to 2.6.32-29-generic; but booting with the previous kernel version doesn't help. Installing the "linux-firmware-nonfree" drivers made no difference.
Start-Date: 2011-03-02  10:42:14
Install: linux-headers-2.6.32-29-generic (2.6.32-29.58), linux-headers-2.6.32-29 (2.6.32-29.58), linux-image-2.6.32-29-generic (2.6.32-29.58)
Upgrade: linux-generic (2.6.32.28.32, 2.6.32.29.35), linux-headers-generic (2.6.32.28.32, 2.6.32.29.35), linux-image-generic (2.6.32.28.32, 2.6.32.29.35), linux-libc-dev (2.6.32-28.55, 2.6.32-29.58)
End-Date: 2011-03-02  10:43:29
What has changed to cause this?
Others with the same problem are posting to
http://ubuntuforums.org/showthread.php?t=1699550

Fix Pink / Red YouTube Videos Bug Using Flash 10.2 In Ubuntu
"Possible fix 3" solved the problem by disabling "hardware acceleration" in Flash. So who's bug is this? Is the bug in Flash 10.2 or something it calls? In any case, plugins shouldn't be able to crash X11. So X11 is behaving badly; but the errors show up in "kern.log"; does the kernel deserve some of the blame?