Tranalyzer - Post-processing with TAWK

Tutorial: Post-processing with TAWK

Prerequisites
General Introduction
What features (columns) are available?
What functions are available?
How to use a specific function?
How to interpret a specific column?
How to decode all aggregated fields in Tranalyzer log file?
Print all 5 tuples (source and destination IP and ports, protocol)
Print the hosts involved in the most flows
Ignore all flows between private IPs
Print the source and destination addresses of all DNS flows related to Facebook
Replace the protocol number by its string representation, e.g., 6 -> TCP
Replace the Unix timestamp used for timeFirst and timeLast by their value in UTC
Replace the Unix timestamp used for timeFirst and timeLast by their values in localtime
Print the 10 hosts sending the most bytes over UDP
Inspect the flow number 1234 in the flow file
Follow a specific flow, e.g., the flow with flow index 1234, in the packet file
Inspect the packet number 1234 in the packet file
Follow a flow (similar to Wireshark follow TCP/UDP stream):
Recreate a binary file transferred in a B flow:
Extract all flows whose HTTP Host: header matches google using Wireshark field names
Extract the DNS query field from all flows where at least one DNS answer was seen (using Wireshark field names)
Open all ICMP flows involving the network 1.2.3.4/24 in Wireshark
Create a PCAP files with all TCP flows with port 80 or 8080
Writing a tawk Function
Using tawk Within Scripts
Using tawk With Non-Tranalyzer files
Mapping External Column Names to Tranalyzer Column Names
Examples
See Also

This tutorial presents tawk functionality through various scenarios. tawk works just like awk, but provides access to the columns via their names. In addition, it provides access to helper functions, such as host() or port(). For an overview, refer to the Alphabetical List of TAWK Functions. Custom functions can be added in the folder named t2custom where they will be automatically loaded.

Prerequisites

This tutorial assumes a working knowledge of awk.

Dependencies

gawk version 4.1 is required.

Kali/Ubuntu	`sudo apt-get install gawk`
Arch	`sudo pacman -S gawk`
Fedora/Red Hat	`sudo yum install gawk`
Gentoo	`sudo emerge gawk`
openSUSE	`sudo zypper install gawk`
macOS	`brew install gawk` (Homebrew package manager)

Installation

The recommended way to install tawk is to install t2_aliases as documented in README.md:

Append the following line to ~/.bashrc:

if [ -f "$T2HOME/scripts/t2_aliases" ]; then
    . $T2HOME/scripts/t2_aliases             # Note the leading `.'
fi

Make sure to replace $T2HOME with the actual path, e.g., $HOME/tranalyzer-0.8.4/plugins.

Documentation (Man Pages)

The man pages for tawk and t2nfdump (more on that later) can be installed by running: ./install.sh man. Once installed, they can be consulted by running man tawk and man t2nfdump respectively.

General Introduction

Command line options

First, run tawk -h to list the available command line options:

$ tawk -h

Usage:
    tawk [OPTION...] 'program' file_flows.txt
    tawk [OPTION...] -I file_flows.txt 'program'

Input arguments:
    -I file             Alternative way to specify the input file

Optional arguments:
    -N num              Row number where column names are to be found
    -s char             First character for the row listing the columns name
    -F fs               Use 'fs' as input field separator
    -O fs               Use 'fs' as output field separator
    --csv               Set input and output separators to ',' and
                        extract names from first row
    --zeek              Configure tawk to work with Bro/Zeek log files
    -n                  Load nfdump functions
    -e                  Load examples functions
    -X xerfile          Specify the '.xer' file to use with -k and -x options
    -x outfile          Run the fextractor on the extracted data
    -P                  Extract specific packets instead of whole flows
    -k                  Run Wireshark on the extracted data
    -t                  Do not validate column names
    -r                  Try renaming invalid columns (suffix them with '_')
    -H                  Do not output the header (column names)
    -c[=u]              Output command line as a comment
                        (use -c=u for UTC instead of localtime)

Help and documentation arguments:
    -l[=n], --list[=n]  List column names and numbers
    -g[=n], --func[=n]  List available functions

    -d fname            Display function 'fname' documentation
    -V vname[=value]    Display variable 'vname' documentation

    -L                  Decode all variables from Tranalyzer log file

    -D                  Display tawk PDF documentation

    -?, -h, --help      Show help options and exit

-s and -N Options

The -s option can be used to specify the starting character(s) of the row containing the column names (default: %). If several rows start with the specified character(s), then the last one is used as column names. To change this behaviour, the line number can be specified as well with the help of the -N option. For example, if rows 1 to 5 start with # and row 3 contains the column names, specify the separator as follows: tawk -s "#" -N 3. If the row with column names does not start with a special character, use -s ""}.

What features (columns) are available?

$ tawk -l file_flows.txt

What functions are available?

$ tawk -g file_flows.txt

Alternatively, refer to the Alphabetical List of TAWK Functions.

How to use a specific function?

$ tawk -d function_name

How to interpret a specific column?

$ tawk -V colName
$ tawk -V colName=value

How to decode all aggregated fields in Tranalyzer log file?

$ tawk -L out_log.txt
$ t2 -r file.pcap | tawk -L

Print all 5 tuples (source and destination IP and ports, protocol)

$ tawk '{ print tuple5() }' file_flows.txt

Print the hosts involved in the most flows

$ tawk '{ aggr($srcIP); aggr($dstIP) }' file_flows.txt

Ignore all flows between private IPs

$ tawk 'not(privip($srcIP) && privip($dstIP))' file_flows.txt

$ tawk 'wildcard("^dns.*") ~ /facebook/ { print tuple2() }' file_flows.txt

Replace the protocol number by its string representation, e.g., 6 -> TCP

$ tawk '{ $l4Proto = proto2str($l4Proto); print }' file_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their value in UTC

$ tawk '{ $timeFirst = utc($timeFirst); $timeLast = utc($timeLast); print }' file_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their values in localtime

$ tawk '{ $timeFirst = localtime($timeFirst); $timeLast = localtime($timeLast); print }' file_flows.txt

Print the 10 hosts sending the most bytes over UDP

$ tawk -H '
    # A flows only
    udp() && !bitsallset($flowStat, 1) {
        aggr($srcIP, $numBytesSnt, 10)
        aggr($dstIP, $numBytesRcvd, 10)
    }
    ' file_flows.txt

Inspect the flow number 1234 in the flow file

$ tawk 'flow(1234)' file_flows.txt

Follow a specific flow, e.g., the flow with flow index 1234, in the packet file

$ tawk 'flow(1234)' file_packets.txt

Inspect the packet number 1234 in the packet file

$ tawk 'packet(1234)' file_packets.txt

Follow a flow (similar to Wireshark follow TCP/UDP stream):

$ tawk 'follow_stream(1)' file_packets.txt

Recreate a binary file transferred in a B flow:

$ tawk 'follow_stream(1, 3, "B")' file_packets.txt | xxd -p -r > out.data

Extract all flows whose HTTP Host: header matches google using Wireshark field names

$ tawk 'shark("http.host") ~ /google/' file_flows.txt

Extract the DNS query field from all flows where at least one DNS answer was seen (using Wireshark field names)

$ tawk 'shark("dns.count.answers") { print shark("dns.qry.name") }' file_flows.txt

Open all ICMP flows involving the network 1.2.3.4/24 in Wireshark

$ tawk -k 'icmp() && host("1.2.3.4/24")' file_flows.txt

Create a PCAP files with all TCP flows with port 80 or 8080

$ tawk -x file.pcap 'tcp() && port("80;8080")' file_flows.txt

Writing a tawk Function

Ideally one function per file (where the filename is the name of the function)
Private functions are prefixed with an underscore
Always declare local variables 8 spaces after the function arguments
Local variables are prefixed with an underscore
Use uppercase letters and two leading and two trailing underscores for global variables
Include all referenced functions
Files should be structured as follows:

#!/usr/bin/env awk
#
# Function description
#
# Parameters:
#   - arg1: description
#   - arg2: description (optional)
#
# Dependencies:
#   - plugin1
#   - plugin2 (optional)
#
# Examples:
#   - tawk `funcname()' file.txt
#   - tawk `{ print funcname() }' file.txt

@include "hdr"
@include "_validate_col"

function funcname(arg1, arg2, [8 spaces] _locvar1, _locvar2) {
    _locvar1 = _validate_col("colname1;altcolname1", _my_colname1)
    _validate_col("colname2")

    if (hdr()) {
        if (__PRIHDR__) print "header"
    } else {
        print "something", _locvar1, $colname2
    }
}

Copy your files in the t2custom folder.
To have your functions automatically loaded, include them in the file t2custom/t2custom.load.

Using tawk Within Scripts

To use tawk from within a script:

Create a TAWK variable pointing to the script: TAWK="$T2HOME/scripts/tawk/tawk" (make sure to replace $T2HOME with the actual path to the scripts folder)
Call tawk as follows: $TAWK 'dport(80)' file.txt

Using tawk With Non-Tranalyzer files

tawk can also be used with files which were not produced by Tranalyzer.

The input field separator can be specified with the -F option, e.g., tawk -F ',' 'program' file.csv
The row listing the column names, can start with any character specified with the -s option, e.g., tawk -s '#' 'program' file.txt
All the column names must not be equal to a function name (tawk will rename them with a trailing underscore if -t option is NOT being used)
Valid column names must start with a letter (a-z, A-Z) and can be followed by any number of alphanumeric characters or underscores
If no column names are present, use the -t option to prevent tawk from trying to validate the column names.
If the column names are different from those used by Tranalyzer, refer to the next section.

Mapping External Column Names to Tranalyzer Column Names

If the column names are different from those used by Tranalyzer, a mapping between the different names can be made in the file scripts/tawk/my_vars. The format of the file is as follows:

BEGIN {
    _my_srcIP = non_t2_name_for_srcIP
    _my_dstIP = non_t2_name_for_dstIP
    ...
}

Once edited, run tawk with the -i $T2HOME/scripts/tawk/my_vars option and the external column names will be automatically used by tawk functions, such as tuple2(). For more details, refer to the my_vars file itself.

Using tawk with Bro/Zeek Files

To use tawk with Bro/Zeek log files, use one of --bro or --zeek option:

$ tawk --bro '{ program }' file.log
$ tawk --zeek '{ program }' file.log

Examples

Pivoting (variant 1):
- First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header:
```
$ tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }'
```
- Then, put the result of the last command in the badguy variable and use it to extract flows involving this IP:
```
$ tawk -v badguy="$(!!)" 'host(badguy)' FILE_flows.txt
```
Pivoting (variant 2):
- First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header, and store it into a badip variable:
```
$ badip="$(tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }')"
```
- Then, use the badip variable to extract flows involving this IP:
```
$ tawk -v badguy="$badip" 'host(badguy)' FILE_flows.txt
```
Aggregate the number of bytes sent between source and destination addresses (independent of the protocol and port) and output the top 10 results:
```
$ tawk 'aggr($srcIP4 OFS $dstIP4, $numBytesSnt, 10)' FILE_flows.txt
```
Aggregate the number of bytes, packets and flows sent over TCP between source and destination addresses (independent of the port) and output the top 20 results (output sorted accorded to numBytesSnt):
```
$ tawk 'tcp() { aggr(tuple2(), $numBytesSnt OFS $numPktsSnt OFS "Flows", 20) }' FILE_flows.txt
```
Sort the flow file according to the duration (longest flows first) and output the top 5 results:
```
$ tawk 't2sort(duration, 5)' FILE_flows.txt
```
Extract all TCP flows:
```
$ tawk 'tcp()' FILE_flows.txt
```
Extract all flows whose destination port is between 6000 and 6008 (included):
```
$ tawk 'dport("6000-6008")' FILE_flows.txt
```
Extract all flows whose destination port is 53, 80 or 8080:
```
$ tawk 'dport("53;80;8080")' FILE_flows.txt
```

Extract all flows involving an IP in the subnet 192.168.1.0/24 (using the host() or net() function):

$ tawk 'host("192.168.1.0/24")' FILE_flows.txt
$ tawk 'net("192.168.1.0/24")' FILE_flows.txt

Extract all flows whose destination IP is in subnet 192.168.1.0/24 (using the dhost() or dnet() function):

$ tawk 'dhost("192.168.1.0/24")' FILE_flows.txt
$ tawk 'dnet("192.168.1.0/24")' FILE_flows.txt

Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the shost() or snet() function):

$ tawk 'shost("192.168.1.0/24")' FILE_flows.txt
$ tawk 'snet("192.168.1.0/24")' FILE_flows.txt

Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinrange() function):
```
$ tawk 'ipinrange($srcIP4, "192.168.1.0", "192.168.1.255")' FILE_flows.txt
```
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function):
```
$ tawk 'ipinnet($srcIP4, "192.168.1.0", "255.255.255.0")' FILE_flows.txt
```
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a hex mask):
```
$ tawk 'ipinnet($srcIP4, "192.168.1.0", 0xffffff00)' FILE_flows.txt
```
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and the CIDR notation):
```
$ tawk 'ipinnet($srcIP4, "192.168.1.0/24")' FILE_flows.txt
```
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a CIDR mask):
```
$ tawk 'ipinnet($srcIP4, "192.168.1.0", 24)' FILE_flows.txt
```

For more examples, refer to tawk -d option, e.g., tawk -d aggr, where every function is documented and comes with a set of examples. For more complex examples, have a look at the scripts/t2fm/tawk/ folder. The complete documentation can be consulted by running tawk -d all.