I spend a great deal of my day at the command-line. Therefore, it tends to be slightly annoying when I have to go into R or python and read in a file just to get a summary of a single column of data to make sure that something is not wrong with my analyses. I like to err on the cautious side and check the data that I am working with fairly frequently. So coming up with some sets of shell commands that might be able to compute summary statistics from text-based data-tables would make things go a little more seamlessly.
I wrote two versions of a function to get summary statistics on a particular column of data (really simple ones), one using R and one using mawk. Mawk is a variant on the ever-popular awk command-line utility that has been optimized for a subset of commands and is much faster in most cases.
Using mawk :
For a comparison of the methods, I ran both functions on a 231 MB file representing phased genotypes from Chromosome 22 of the 1000 Genomes Consortium Phase 3 data. I used the functions
mawk_stats to calculate summary statistics of the phenotype across individuals, which I simulated using R.
The comparison of the two functions (along with a version of
mawk_stats that uses awk) can be shown by the below screenshot.
It is clear that using
awk gains us a little speed over the version in R, but that could be attributable to the fact that we do not compute the median or quartiles. However, when we move to using
mawk we see a huge speedup (2 seconds vs. 19 seconds).
An epilogue to this brief adventure is that while the version of
mawk_stats is quite a bit faster than the R counterpart, it is missing some information. Future directions would be to implement Hoare’s Selection Algorithm using mawk to get the median efficiently and then calculate the quartiles as well. Recursion in (m)awk is a little bit tricky and I will likely devote an entire note to this implementation in the (somewhat) near future.