Velvet Star Monitor

Standout celebrity highlights with iconic style.

general

Joining two text files with multiple keys

Writer Matthew Harrington

I have two sets of data. and I want to join them by two key values (storm_ID, Cell_ID).

The first dataset looks like:

Storm_ID,Cell_ID,Wind_speed
2,10236258,27
2,10236300,58
2,10236301,25
3,10240400,51

The second dataset looks like:

Storm_ID,Cell_ID,Storm_surge
2,10236299,0.27
2,10236300,0.27
2,10236301,0.35
2,10240400,0.35
2,10240401,0.81
4,10240402,0.11

Now I want an output which looks something like this:

Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11

I tried join command in Linux to perform this task and failed badly. Join command skipped the rows which didn't match in the database. I can use Matlab but the size of the data is more than 100 GB which is making it very difficult for this task. Can someone please guide me on this one please. Can I use SQL or python to complete this task.

2 Answers

Using awk and sort:

awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n
  • -F, -v OFS=, - set the input and output to be separated by ,
  • {x = $1 "," $2} save the first two fields separated by ,, since the combination is be the common index.
  • FNR == NR {a[x] = $3; b[x] = 0; next} - FNR is the record number per-file, and NR is the overall record number across files. These are equal for the first file, so this block is run only for the first file. Here, I save the third column of the the first file in the array a, and initialize corresponding entry in b to 0. And then I skip to the next record.
  • {b[x] = $3} !a[x] {a[x] = 0} - these two are run for the second file, saving the third column in b, and if there's no corresponding entry in a, set it to 0.
  • END {for (i in a) print i, a[i], b[i]}, at the end of both files, print every record obtained so far

Since looping over arrays in awk gives a random order, we need to sort the output at the end with sort -n:

$ awk -F, -v OFS=, '{x = $1 "," $2} FNR == NR {a[x] = $3; b[x] = 0; next} {b[x] = $3} !a[x] {a[x] = 0} END {for (i in a) print i, a[i], b[i]}' f1 f2 | sort -n
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
2

You may be able do it with join if you merge the first two fields into a single key by replacing the comma separator with a character that you can be sure won't appear in your data. Note that since join demands that the data are sorted on the join field, this will only work if doing so does not change the lexical order of the data.

For example:

$ join -t, -a1 -a2 -e0 -o0,1.2,2.2 <(sed 's/,/;/' file1) <(sed 's/,/;/' file2) | sed 's/;/,/'
Storm_ID,Cell_ID,Wind_speed,Storm_surge
2,10236258,27,0
2,10236299,0,0.27
2,10236300,58,0.27
2,10236301,25,0.35
2,10240400,0,0.35
2,10240401,0,0.81
3,10240400,51,0
4,10240402,0,0.11
3

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy