Computation‎ > ‎awk‎ > ‎

advanced awk/gawk

The scripts below rely on the built in variables of gawk. For them to work you will need to ensure you have the gawk version of awk.

Pair two files based on key column in each file

Bed file of gene locations

chr1    11869    14409    DDX11L10
chr1    14363    29570    WASH5P
chr1    34554    36081    FAM138F
chr1    34554    36081    FAM138B
chr1    34554    36081    FAM138A
chr1    69055    70108    OR4F5

List of genes you want to know chr start stop for

WASH5P    patient_1234    A    T
FAM138B    patient_2345    A    C
OR4F5    Patient_3456    G    A

Awk code
$ awk 'FNR==NR { a[$4]=$0;next } ($1 in a) { OFS = "\t" ; print a[$1],$2,$3,$4 }' file1.txt file2.txt

chr1    14363    29570    WASH5P    patient_1234    A    T
chr1    34554    36081    FAM138B    patient_2345    A    C
chr1    69055    70108    OR4F5    Patient_3456    G    A


FNR represents the record number (row/line) of the current file awk is currently working on. NR represents the record number (row/line) awk has worked on so far. By setting FNR equal to NR we are telling awk to perform the next action (What is within {} immediately following FNR==NR) only on the first input file and perform the following actions on the next input file.

{ a[$4]=$0;next }
This part is creating a hash table called "a" that uses the 4th ($4) field from file1.bed as the index and is equal to the entire record (row/line). awk reads each record (row/line) one at a time adding a record to the hash table moving to the next record until it reaches the end of file1.txt

($1 in a)
This is a test statement which asks if the first field of the record being read by awk matches an index in the hash table "a" then perform the action immediately after in {}.

{ OFS = "\t" ; print a[$1],$2,$3,$4 }
The first part tell awk to make the output delimited by tab. the second print part tells awk to print the value of the hash that has the index from the first field of file2.txt then field 2,3, and 4 from file2.txt.