Random Sampling From a File
See the Linux commands in action that will set up a random sampling of values from a file, either with or without replacement.
Join the DZone community and get the full member experience.
Join For FreeI recently learned about the Linux command line utility shuf
from browsing The Art of Command Line. This could be useful for random sampling.
Given just a file name, shuf
randomly permutes the lines of the file.
With the option -n
, you can specify how many lines to return. So it’s doing sampling without replacement. For example...
shuf -n 10 foo.txt
... would select 10 lines from foo.txt
.
Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with fewer than 10 lines. If you ask for an impossible number of lines, the -n
option is ignored.
You can also sample with replacement using the -r
option. In that case, you can select more lines than are in the file since lines may be reused. For example, you could run ...
shuf -r -n 10 foo.txt
... to select 10 lines drawn with replacement from foo.txt
, regardless of how many lines foo.txt
has. For example, when I ran the command above on a file containing
alpha
beta
gamma
I got the output:
beta
gamma
gamma
beta
alpha
alpha
gamma
gamma
beta
I don’t know how shuf
seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.
Published at DZone with permission of John Cook, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments