sample
Produce a sample of lines from files. The sample size is either fixed or proportional to the size of the file. Additionally, the header and footer can be included in the sample.
Red tape
- no dependencies other than a POSIX system and a C99 compiler.
- licensed under BSD3c
Resources
- homepage: http://www.fresse.org/sample/
- github: https://github.com/hroptatyr/sample
- issues: https://github.com/hroptatyr/sample/issues
- download: https://github.com/hroptatyr/sample/releases
Features
- proportional sampling of streams and files
- header and footer can be included in the sample
- reservoir sampling (fixed sample size) of streams and files
- stable reservoir sampling (i.e. the order is preserved)
Motivation
Practically ubiquitous, there’s shuf -n
of GNU coreutils, a tool
that, in principle, solves the problem at hand. However, shuf
buffers
all input and is therefore useless for files that don’t fit in memory.
So, looking for alternatives one may come across paulgb’s subsample or earino’s fast_sample. They usually do the trick and everyone seems to agree (judged by github stars). However, both tools have short-comings: they try to make sense of the line data semantically, and secondly, they are slow!
The first issue is such a major problem that their bug trackers are
full of reports. subsample
needs lines to be UTF-8 strings and
fast_sample
wants CSV files whose correctness is checked along the
way. This project’s tool, sample
, on the other hand does not care
about the line’s content, all it needs are those line breaks at the
end.
The speed issue is addressed by
- using the most appropriate programming language for the problem
- using radix sort
- using the PCG family to obtain randomness
- oversampling
Examples
To get 10 random words from the words
file:
$ sample -n 10 -H 0 /usr/share/dict/words
...
benzopyrene
calamondins
cephalothorax
copulate
garbology's
Kewadin
Peter's
reassembly
Vienna's
Wagnerism's
...
The -H 0
produces 0 lines of header output which defaults to 5.
For proportional sampling use -r|--rate
:
$ wc -l /usr/share/dict/words
305089
$ sample -r 1% /usr/share/dict/words | wc -l
3080
which is close to the true result bearing in mind that by default the header and footer of the file is printed as well.
Sampling with a rate of 0 replaces awkward scripts that use multios
and head
and tail
to produce the same result.
$ sample -r 0 /usr/share/dict/words
A
AA
AAA
Aachen
aah
...
Zyuganov
Zyuganov's
zyzzyva
zyzzyvas
ZZZ
Similar projects
In no particular order and without any claim to completeness:
- subsample: https://github.com/paulgb/subsample
- fast_sample: https://github.com/earino/fast_sample