Delete multiple columns using awk or sed
Matthew Martinez
I have a database with 6037 space-separated columns and 450 rows like the one below:
1807 1452 1598 1 6.655713 A B A B ... 0
1808 1452 1763 1 9.362033 0 0 A B ... A
1809 1452 1527 2 6.728534 A B A A ... B
1810 1452 1367 2 9.4055 A B A A B ... A
... ... ... ... ... ... ... ... ... ...
1812 1452 1258 1 6.363032 0 0 A B ... BI want to get a new database with only the first 676 columns.
Preferably, some form that uses awk or sed command.
3 Answers
If the column delimiter in your file is a single character, e.g. a space, cut can do that easily:
cut -d' ' -f-676 <in >outThis prints only the space-separated columns from the first to the 676th.
If you need e.g. every whitespace character to count as a delimiter, a sed solution is:
sed -r 's/\s+\S+//677g' <in >outThis replaces every column (= at least one whitespace character followed by at least one non-whitespace character) beginning with the 677th with nothing. Using character groups you can specify any set of delimiters you need, e.g. for “4”, “#” and “K”:
sed -r 's/[4#K]+[^4#K]+//677g' <in >outFor a reasonable awk approach kindly refer to steeldriver’s answer, but here is another one looping over the columns and only printing them (separated by FS) if their number is <= 676:
awk '{for (i=1;i<=676;i++) {printf (i==1?"":FS)$i}; print ""}' <in >outFor a character group you have to specify the output field separator for the output, e.g. for [4#K] and "sep":
awk -F'[4#K]' '{for (i=1;i<=676;i++) {printf (i==1?"":"sep")$i}; print ""}' <in >out 0 For a single-character delimiter (such as space or comma) I would recommend using the cut command over either awk or sed.
However since you asked about awk specifically, I think a reasonable way to do it would be to decrement the field count:
awk -v last=676 '{NF = last} 1' datafileTested in GNU Awk (gawk) and mawk.
You could use
mlr --nidx --fs ' ' --repifs cat inputFile.csv | cut -d ' ' -f-2In this way with mlr () you manage field separators (if you have more than one spaces, they become one per field), and with cut you extract (in my example) the first two fields.
From
1807 1452 1598 1 6.655713 A B A B
1808 1452 1763 1 9.362033 0 0 A B
1809 1452 1527 2 6.728534 A B A A
1810 1452 1367 2 9.4055 A B A A Bto
1807 1452
1808 1452
1809 1452
1810 1452Some notes about Miller options:
--nidxis to set the format; this is a generic index-numbered table (the first field is 1, the second is 2, ecc..);--fsto set the separator (here is a space);--repifsmeans that multiple successive occurrences of the field separator count as onecatpasses input records directly to output.