Joining two text files with multiple keys
Matthew Harrington
I have two sets of data. and I want to join them by two key values (storm_ID, Cell_ID).
The first dataset looks like:
Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51The second dataset looks like:
Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11Now I want an output which looks something like this:
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11I tried join command in Linux to perform this task and failed badly. Join command skipped the rows which didn't match in the database. I can use Matlab but the size of the data is more than 100 GB which is making it very difficult for this task. Can someone please guide me on this one please. Can I use SQL or python to complete this task.
2 Answers
Using awk and sort:
awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n-F, -v OFS=,- set the input and output to be separated by,{x = $1 "," $2}save the first two fields separated by,, since the combination is be the common index.FNR == NR {a[x] = $3; b[x] = 0; next}-FNRis the record number per-file, andNRis the overall record number across files. These are equal for the first file, so this block is run only for the first file. Here, I save the third column of the the first file in the arraya, and initialize corresponding entry inbto 0. And then I skip to the next record.{b[x] = $3} !a[x] {a[x] = 0}- these two are run for the second file, saving the third column inb, and if there's no corresponding entry ina, set it to 0.END {for (i in a) print i, a[i], b[i]}, at the end of both files, print every record obtained so far
Since looping over arrays in awk gives a random order, we need to sort the output at the end with sort -n:
$ awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11 2 You may be able do it with join if you merge the first two fields into a single key by replacing the comma separator with a character that you can be sure won't appear in your data. Note that since join demands that the data are sorted on the join field, this will only work if doing so does not change the lexical order of the data.
For example:
$ join -t, -a1 -a2 -e0 -o0,1.2,2.2 <(sed 's/,/;/' file1) <(sed 's/,/;/' file2) | sed 's/;/,/'
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11 3