glod (grokking lots of data)
Taming copious amounts of data has become daily routine for many people in various disciplines. This toolbox (while still trying to find its niche) focusses on preparing the data for further processing using other tools or frameworks.
The toolset consists of various little command line utilities.
Rationale
The glod suite serves as an umbrella for tools that were/are actually needed in a production environment but which are yet too trivial or small to justify a full-fledged repository.
This is the primary reason for the seemingly odd coverage of problems, and might be the reason for tools to appear and disappear willy-nilly.
All tools deliberately ignore system-wide or user-specific localisation settings (locales)! This (and of course speed) sets glod apart from tools like PRETO, JPreText or OpenRefine.
Moreover, most of the tools deliberately sacrifice portability for speed on the actual production platform (which is 64bit AVX2 Intel). This goes as far as using every trick in the book, e.g. Cilk, nested functions, assembler-backed co-routines, automatic CPU dispatch, and more. The downside, obviously, is that underfeatured compilers (yes, clang, looking at you) won’t be able to build half the tools.
glep
A multi-pattern grep. Report matching patterns in files. All patterns are looked for in parallel in all of the specified files.
Matching files and patterns are printed to stdout (separated by tabs):
$ cat pats
"virus"
"glucose"
"pound"
"shelf"
$ glep -f pats testfile1 testfile2
virus testfile1
shelf testfile2
pound testfile2
terms
A fast text file tokeniser. Output terms occurring in the specified files, one term per line, and different files separated by a form feed.
A term (by our definition) is a sequence of alphanumerical characters that can be interluded (but not prefixed or suffixed) by punctuation characters.
$ terms testfile1
New
virus
found
Output of the terms
utility can be fed into other tools that follow
the bag-of-words approach. For instance to get a frequency vector in no
time:
$ cat testfile1 | terms | sort | uniq -c
1 New
1 found
1 virus
Or to assign a numeric mapping:
$ cat testfile1 | terms | sort -u | nl
1 New
2 found
3 virus
The terms
utility is meant for bulk operations on corpora of utf8
encoded text files without language labels or other forms of
preclustering.
System-wide or local i18n settings are explicitly ignored! This might lead to complications when mixing glod tools with other preprocessing tools.
enum
Enumerate terms from stdin. This tool reads strings, one per line, and assigns them an integer. Much like an SQL SERIAL. Consider
$ cat testfile
this
is
this
test
$
and now enumerating the lines
$ enum < testfile
1
2
1
3
$
uncol
Turn columnised text into tab-separated form again, i.e. undo the
columnisation (as produced for instance with column(1)
from
util-linux).
$ cat testfile
INSTR EXCH ISIN
WIGA XFRA DE000A11QCU2
TTY XFRA US8919061098
$ uncol < testfile
INSTR EXCH ISIN
WIGA XFRA DE000A11QCU2
TTY XFRA US8919061098
$
or to demonstrate more clearly using a different output delimiter:
$ uncol --output-delimiter ';' < testfile
INSTR;EXCH;ISIN
WIGA;XFRA;DE000A11QCU2
TTY;XFRA;US8919061098
$