(Encrypted) traffic mining

gnuplot inter-arrival time math nFrstPkts pktSIATHisto preprocessing sampling signal statistics t2plot traffic mining voip

Introduction

Traffic Mining is the art of extracting hidden, obfuscated or encrypted information from IP traffic, by only observing the layer 2 to layer 4 header features. It exploits the fact that nobody produces perfect and secure code when writing Internet applications. The key points are libraries used by every one, such as audio codecs. They have intrinsic features and physical characteristics which cannot be changed without impeding correct functionality. This characteristic behavior reflects itself in layer 3 and 4 header features, independent of any encryption on layer 7. A prominent feature is the packet length (PL) and the inter-arrival time (IAT), also known as packet inter-distance, of the consecutive packets in an A or B flow.

Using these two parameters, not only the type of the traffic can be revealed, but also the content. To achieve this, two approaches of preprocessing are effective:

Statistical approach (pktSIATHisto and descriptiveStats plugins)
Signal processing approach (nFrstPkts plugin)

The major work in classification of encrypted traffic is the quality of the preprocessing. Hence, T2 focusses on what type of data should be fed into a classifier or a feature selection mechanism to produce optimal results.

In this tutorial, we will discuss these preprocessing approaches using the traffic from skypeu.pcap, which contains a simple Skype voice conversation between two peers. For illustration t2plot, a wrapper for gnuplot is used.

Prerequisites

Create folders for your data and results

If you have not created a separate data and results directory yet, please do it now. This will greatly facilitate your workflow:

mkdir ~/data ~/results

Reset tranalyzer2 and the plugins configuration

If you have followed the other tutorials, you may have modified some of the core and plugins configuration. To ensure your results match those in this tutorial, make sure to reset everything:

t2conf -a --reset

You can also clean all build files:

t2build -a -c

Empty the plugin folder

To ensure we are not left with some unneeded plugins or plugins which were built using different core configuration, it is safer to empty the plugins folder:

t2build -e -y


Are you sure you want to empty the plugin folder '/home/user/.tranalyzer/plugins' (y/N)? yes
Plugin folder emptied

Download the PCAP files

The PCAP files used in this tutorial can be downloaded here:

Please save them in your ~/data folder:

wget --no-check-certificate -P ~/data https://tranalyzer.com/download/data/{film,skypeu}.pcap

Getting started

Build tranalyzer2 and the required plugins

For this tutorial, we will need to build the core (tranalyzer2) and the following plugins:

As you may have modified some of the automatically generated files, it is safer to use the -r and -f options.

t2build -r -f tranalyzer2 basicFlow nFrstPkts pktSIATHisto descriptiveStats txtSink


...

BUILDING SUCCESSFUL

Run tranalyzer2

Now run t2 on skypeu.pcap:

t2 -r ~/data/skypeu.pcap -w ~/results

And look at the resulting files:

ls ~/results


skypeu_flows.txt  skypeu_headers.txt

Troubleshooting

If you use your own pcap, which might contain flows with an abnormal, broad and diverse PL/IAT distribution, t2 could terminate with the following message:


[ERR] pktSIATHisto: Failed to insert new tree node. Increase PSIAT_NDPLF in pktSIATHisto.h and recompile the plugin

Normally this should not happen, because the HISTO_NODEPOOL_FACTOR in pktSIATHisto.h is set to 17, which suffices for a large tree of PLs and IATs.

grep 'HISTO_NODEPOOL_FACTOR' $T2PLHOME/pktSIATHisto/src/pktSIATHisto.h

#define HISTO_NODEPOOL_FACTOR  17 // multiplication factor red-black tree nodepool:
                                  // sizeof(nodepool) = HISTO_NODEPOOL_FACTOR * mainHashMap->hashChainTableSize

Nevertheless, increase the HISTO_NODEPOOL_FACTOR to 18 or a bit higher:

t2conf pktSIATHisto -D HISTO_NODEPOOL_FACTOR=18

Recompile:

t2build pktSIATHisto

And see what happens when you re-run t2. If the message is accompanied by:


[WRN] Hash Autopilot: main HashMap full: flushing 1 oldest flow(s)!
[INF] Hash Autopilot: Fix: Invoke Tranalyzer with '-f value'

Then leave the HISTO_NODEPOOL_FACTOR alone and just restart t2 with the proposed -f value, e.g.,

t2 -r ~/data/your.pcap -w ~/results -f value

So now you are all set for any pcap mishap that might hit you in the future. Let’s start with the TM statistical approach.

Statistical approach

To profile traffic, the flow representation is the most convenient one, because the nature of a traffic type can be compressed into a collection of numbers, e.g., a vector, which can then be post-processed by standard programs such as Matlab, SPSS, Excel or by an AI plugin.

T2 produces several columns with statistical PL and IAT output. An excerpt is listed below from the header file: ~/results/skypeu_headers.txt

cat ~/results/skypeu_headers.txt

...
# Col No.   Type                        Name                        Description
...
23          U32                         nFpCnt                      Number of signal samples
24          U32_U64.U32:R               L2L3L4Pl_Iat                L2/L3/L4/Payload (s. PACKETLENGTH in packetCapture.h) length and IAT for the N first packets
25          U32                         tCnt                        Number of tree entries
26          U16_U32_U32_U32_U32:R       Ps_Iat_Cnt_PsCnt_IatCnt     Packet size (PS) and min inter-arrival time (IAT) of bin histogram
27          F                           dsMinPl                     Minimum packet length
28          F                           dsMaxPl                     Maximum packet length
29          F                           dsMeanPl                    Mean packet length
30          F                           dsLowQuartilePl             Lower quartile of packet lengths
31          F                           dsMedianPl                  Median of packet lengths
32          F                           dsUppQuartilePl             Upper quartile of packet lengths
33          F                           dsIqdPl                     Inter quartile distance of packet lengths
34          F                           dsModePl                    Mode of packet lengths
35          F                           dsRangePl                   Range of packet lengths
36          F                           dsStdPl                     Standard deviation of packet lengths
37          F                           dsRobStdPl                  Robust standard deviation of packet lengths
38          F                           dsSkewPl                    Skewness of packet lengths
39          F                           dsExcPl                     Excess of packet lengths
40          F                           dsMinIat                    Minimum inter arrival time
41          F                           dsMaxIat                    Maximum inter arrival time
42          F                           dsMeanIat                   Mean inter arrival time
43          F                           dsLowQuartileIat            Lower quartile of inter arrival times
44          F                           dsMedianIat                 Median inter arrival times
45          F                           dsUppQuartileIat            Upper quartile of inter arrival times
46          F                           dsIqdIat                    Inter quartile distance of inter arrival times
47          F                           dsModeIat                   Mode of inter arrival times
48          F                           dsRangeIat                  Range of inter arrival times
49          F                           dsStdIat                    Standard deviation of inter arrival times
50          F                           dsRobStdIat                 Robust standard deviation of inter arrival times
51          F                           dsSkewIat                   Skewness of inter arrival times
52          F                           dsExcIat                    Excess of inter arrival times

For now we are only interested in column 26 of ~/results/skypeu_flows.txt, namely Ps_Iat_Cnt_PsCnt_IatCnt. It contains a 3D statistics and their projections onto PL and IAT.

The packet length inter-arrival time distribution

An example of the PL/IAT distribution of pktSIATHisto for flowInd 1 is listed below

tawk 'flow(1) { print $dir, $Ps_Iat_Cnt_PsCnt_IatCnt }' ~/results/skypeu_flows.txt

%dir    Ps_Iat_Cnt_PsCnt_IatCnt
A       0_0_116_1078_213;0_1_7_1078_8;0_2_1_1078_1;0_7_1_1078_1;0_9_1_1078_2;0_10_2_1078_2;0_11_1_1078_1;0_12_2_1078_79;0_14_1_1078_1;0_25_1_1078_1;0_26_5_1078_5;0_27_3_1078_4;0_28_7_1078_8;0_29_4_1078_4;0_30_1_1078_1;0_31_1_1078_1;0_32_1_1078_3;0_39_5_1078_15;0_49_74_1078_120;0_50_134_1078_273;0_51_101_1078_342;0_52_167_1078_363;0_53_128_1078_208;...
B       0_0_89_1064_197;0_1_5_1064_5;0_3_1_1064_1;0_4_1_1064_1;0_5_2_1064_2;0_6_1_1064_1;0_7_1_1064_1;0_8_1_1064_1;0_9_3_1064_3;0_11_1_1064_5;0_20_1_1064_1;0_21_1_1064_2;0_23_1_1064_1;0_27_2_1064_3;0_28_3_1064_10;0_29_1_1064_1;0_32_1_1064_1;0_35_1_1064_1;0_39_11_1064_15;0_42_1_1064_1;0_44_1_1064_1;0_47_1_1064_1;0_49_14_1064_116;0_50_127_1064_235;...

Every scripting language, such as tawk, awk or perl have a split command which easily breaks up the line above and produces arrays of elements to be further post-processed. Here is an example script:

tawk -H '{ n = split($Ps_Iat_Cnt_PsCnt_IatCnt, A, ";") for (i = 1; i <= n; i++) { split(A[i], B, "_") print B[2], B[1], B[3], B[4], B[5] } }’ ~/results/skypeu_flows.txt

0	0	116	1078	213
1	0	7	1078	8
2	0	1	1078	1
7	0	1	1078	1
9	0	1	1078	2
10	0	2	1078	2
11	0	1	1078	1
12	0	2	1078	79
14	0	1	1078	1
...

A more elaborate post-processing is provided by the script statGplt:

statGplt ~/results/skypeu_flows.txt


Generating '/home/user/results/skypeu_flows_ps.txt'... OK
Generating '/home/user/results/skypeu_flows_iat.txt'... OK
Generating '/home/user/results/skypeu_flows_ps_iat.txt'... OK

This will produce the following three files:

skypeu_flows_ps.txt
skypeu_flows_iat.txt
skypeu_flows_ps_iat.txt

statGplt has a -P option to plot the packet length, IAT and the count as a 3D representation, but for educational purposes, we will use the t2plot script instead:

t2plot -t "PL-IAT 3D statistics" -sy 0:100 -sx 0:40 -o 1:2:3 -v 60,45 ~/results/skypeu_flows_ps_iat.txt

This will result in the following graphics:

3D Packet Length Inter-Distance Statistics

Or look at the projection, namely the packet length statistics. It contains information about the application.

t2plot -t "PL statistics" -sx -1:40 -o 1:2 ~/results/skypeu_flows_ps.txt

Sometimes the IAT statistics bears some information about the application and the user. But often the IAT alone is not significant enough.

t2plot -t "IAT statistics" -sx 0:150 -o 1:2 ~/results/skypeu_flows_iat.txt

Using the -r option, all online features of gnuplot can be used. The PL/IAT, PL and IAT distributions can now be fed into a classifier of your choosing.

Now move to the packet size inter-arrival time plugin (pktSIATHisto):

pktSIATHisto

And look into the pktSIATHisto.h file.

vi src/pktSIATHisto.h

...
/* ========================================================================== */
/* ------------------------ USER CONFIGURATION FLAGS ------------------------ */
/* ========================================================================== */

#define PRINT_HISTO            1 // 1: print histo to flow file
#define HISTO_PRINT_BIN        0 // 1: Bin number; 0: Minimum of assigned inter arrival time.
                                 // (Example: Bin = 10 -> iat = [50:55) -> min(iat) = 50ms)

#define PSI_XCLD               0 // 1: include (PSI_XMIN, UINT16_MAX]
#define PSI_XMIN               1 // if (PSI_XCLD) minimal packet length starts at PSI_XMIN
#define PSI_MOD                0 // > 1: modulo factor of packet length

#define IATSECMAX              3 // max # of section in statistics, last section comprises all elements > IATBINBuN

//#define PSI_XMAX      UINT16_MAX // if (PSI_XCLD) maximal packet length

#define HISTO_EARLY_CLEANUP    0 // 1: after t2OnFlowTerminate tree information is destroyed
                                 // Do NOT switch on when dependent plugin, such as descriptiveStats is loaded!!
#define HISTO_DEBUG            0 // enables debug output

 // Bin boundary & width
#define IATBINBu1     200 // bin boundary of section one: [0, 200) ms
#define IATBINBu2     400
#define IATBINBu3    1000
#define IATBINBu4   10000
#define IATBINBu5  100000
#define IATBINBu6 1000000

#define IATBINWu1       1 // bin width 1ms
#define IATBINWu2       5
#define IATBINWu3      10
#define IATBINWu4      20
#define IATBINWu5      50
#define IATBINWu6     100

/* +++++++++++++++++++++ ENV / RUNTIME - conf Variables +++++++++++++++++++++ */

#define PSIAT_NDPLF    17 // multiplication factor red-black tree nodepool:
                          // sizeof(nodepool) = PSIAT_NDPLF * mainHashMap->hashChainTableSize

/* ========================================================================== */
/* ------------------------- DO NOT EDIT BELOW HERE ------------------------- */
/* ========================================================================== */
...

Change come into effect when the plugin is recompiled. To conserve flow memory space, the resolution of the IAT distribution can be flexibly configured to match the needs of the classifier. E.g., for voice applications the region between 0-400ms need to have a higher resolution than IAT > 1s. For other applications, it might be different. Hence, six sections are predefined, three are activated by setting IATSECMAX. The constant IATBINBu defines the upper boundary of a section while IATBINWu denotes the bin width. Thus, the resulting distribution can be expanded or shrunken to your linking. If more than 6 sections are necessary, you can add new defines and range definitions.

Nevertheless, especially for statistical classifiers or unsupervised learners, such as ESOM, a vector of constant dimensions is more appropriate. For that reason the descriptiveStats plugin was created, supplying PL and IAT statistics vectors up to the 3rd moment.

As the descriptiveStats plugin depends on the pktSIATHisto plugin the latter must ALWAYS be loaded as well.

Descriptive statistics

T2, or more precisely the descriptiveStats plugin, produces a descriptive statistics up to the 3rd moment from the PL/IAT distribution. Taking the pktSIATHisto data from ~/results/skypeu_flows.txt, for the A flow result in the following output:

tawk '(flow(1) && $dir == "A") || hdr() { print wildcard("^ds[A-Z]") }' ~/results/skypeu_flows.txt

dsMinPl  dsMaxPl  dsMeanPl  dsLowQuartilePl  dsMedianPl  dsUppQuartilePl  dsIqdPl  dsModePl  dsRangePl  dsStdPl   dsRobStdPl  dsSkewPl  dsExcPl   dsMinIat  dsMaxIat  dsMeanIat  dsLowQuartileIat  dsMedianIat  dsUppQuartileIat  dsIqdIat  dsModeIat  dsRangeIat  dsStdIat  dsRobStdIat  dsSkewIat  dsExcIat
0        967      11.79017  0                19          22               22       0         967        23.95365  16.3086     29.45955  1160.535  0.5       1000      53.04777   50.5              51.5         53.5              3         52.5       999.5       57.96778  2.2239       13.39272   213.7488

For each flow of a certain class, such a descriptive vector can be fed into a C5.0 or any other classifier for training and testing.

As our small example is not diverse enough, an example of ESOM clustering of unknown 2 GByte 1.7 Gbit/s traffic processed by T2 is depicted below. The resulting map arranges the unknown traffic type into regions, using only the PL descriptive vector.

ESOM of 10000 IP’s, Each dot represents a descriptive PL vector of a flow

The training of the map is derived by our own high performance post processing tool traviz3. Nevertheless, any AI tool can produce the same results. Maybe not with the same speed, but for research purposes they will do their job. Just import the PL vectors of your traffic of choice into Weka or Matlab.

Signal approach

The default configuration of the nFrstPkts plugin produces a signal of the first N packets per flow in ~/results/skypeu_flows.txt. In the default case, it will generate packet length (PL), inter-distance (IAT) tuples which is a well known feature in the traffic analysis community:

PL1_IAT1;PL2_IAT2;PL3_IAT3;...

tawk 'flow(1) { print $dir, $L2L3L4Pl_Iat }' ~/results/skypeu_flows.txt

%dir    L2L3L4Pl_Iat
A       0_0.000000;0_0.000140;14_0.021166;0_0.026188;107_0.000314;967_0.021067;0_0.051023;191_0.018234;0_0.061718;14_0.000392;0_5.527808;0_0.011243;0_0.051940;169_0.028764;...
B       0_0.000000;0_0.021295;14_0.026076;562_0.010252;485_0.022183;0_0.098612;70_0.021302;0_0.000507;22_5.486880;157_0.052041;80_0.051943;0_0.028936;22_0.000196;0_0.042731;...

A small tawk script easily breaks up the lines above and produces arrays of elements to be further post-processed, here is an example that produces a file containing $L2L3L4Pl_Iat vectors from all flow indexes:

tawk -H '{ n = split($L2L3L4Pl_Iat, A, ";") for (i = 1; i <= n; i++) { split(A[i], B, "_") printf "%f%d", B[2], B[1] } }' ~/results/skypeu_flows.txt

0.000000	0
0.000140	0
0.021166	14
0.026188	0
0.000314	107
0.021067	967
0.051023	0
0.018234	191
...

An additional if can select certain flows of interest. A more elaborate post-processing is provided by the script fpsGplt under tranalyzer2/scripts as an inspiration for you:

fpsGplt -h

Usage:
    fpsGplt [OPTION...] <FILE_flows.txt>

Optional arguments:
    -f findex        Flow index to extract [default: all flows]
    -d A|B           Flow direction: A or B only [default: A and B]

    -s               Time sorted ascending
    -t               No time, but counts on x axis [default: time on x axis]
    -i               Invert B flow PL
    -p s             Sample sorted signal with smplIAT in [s]; f = 1/smplIAT
    -e s             Time for each PL pulse edge in [s]

    -j               Calculate the jumps in IAT and report appropriate values
                     for MINIAT(S/U)

    -P               Plot the packet signal

    --gif  file      Generate a GIF file
    --jpeg file      Generate a JPEG file
    --png  file      Generate a PNG file
    --svg  file      Generate a SVG file

Help and documentation arguments:
    -h, --help       Show this help, then exit

The flow index, the flow direction and the time processing can be selected in order to produce the appropriate signal for your purpose. You will see its application later in this tutorial. Let us now discuss some prominent features of the plugin.

Signal preprocessing features nFrstPkts

In order to classify encrypted applications, normally the first 5-10 packets bear enough information because the initiation protocol reflects itself in these first PL/IAT sequences. N depends on the type of job at hand. For the first pcap supplied on the page, N=20 is enough. For the second one, we will need a bigger value. Nevertheless, you can select any N to your liking. Just keep in mind that T2 has to hold all vectors times the amount of flows in memory. So the performance of your machine is also a factor to consider.

The basic signal

The default configuration of nFrstPkts creates a standard PL/IAT vector per flow. In order to produce a basic time based PL signal, the plugin needs to be configured. The configuration options of the plugin can be found in the nFrstPkts.h file. Let us move to the nFrstPkts directory using the nFrstPkts alias

nFrstPkts

and check the value of NFRST_IAT in the nFrstPkts.h file:

grep -Fw '#define NFRST_IAT' src/nFrstPkts.h

#define NFRST_IAT         1 // 0: Time relative to flow start; 1: Inter-arrival time; 2: Absolute time

Alternatively, we could have checked the current value of NFRST_IAT with t2conf -G:

t2conf nFrstPkts -G NFRST_IAT

NFRST_IAT = 1

For this example, set NFRST_IAT to 0 by using t2conf

t2conf nFrstPkts -D NFRST_IAT=0

and then recompile the plugin:

t2build nFrstPkts

A packet length (PL) signal is produced, for each A/B flow starting at time = 0. This is convenient if time aligned vectors of each flow of a certain class is required e.g. to be presented to a neural net. So rerun T2

t2 -r ~/data/skypeu.pcap -w ~/results

The format of the nFrstPkts flow file output is listed below:

PL1_RelTime1;PL2_RelTime2;PL3_RelTime3;...

tawk 'flow(1) { print $dir, $L2L3L4Pl_Iat }' ~/results/skypeu_flows.txt

%dir    L2L3L4Pl_Iat
A       0_0.000000;0_0.000140;14_0.021306;0_0.047494;107_0.047808;967_0.068875;0_0.119898;191_0.138132;0_0.199850;14_0.200242;0_5.728050;0_5.739293;0_5.791233;169_5.819997;82_5.821208;22_5.872195;...
B       0_0.000000;0_0.021295;14_0.047371;562_0.057623;485_0.079806;0_0.178418;70_0.199720;0_0.200227;22_5.687107;157_5.739148;80_5.791091;0_5.820027;22_5.820223;0_5.862954;0_5.872205;22_5.927035;...

In order to produce file also readable by gnuplot and t2plot, run the fpsGplt script:

fpsGplt -f 1 -d A ~/results/skypeu_flows.txt


Generating '/home/wurst/results/skypeu_flows_nps.txt'... OK

cat ~/results/skypeu_flows_nps.txt

time        PL
0.000000    0
0.000140    0
0.021306    14
0.047494    0
0.047808    107
0.068875    967
0.119898    0
0.138132    191
0.199850    0
0.200242    14
5.728050    0
5.739293    0
5.791233    0
5.819997    169
5.821208    82
5.872195    22
5.968054    0
5.980476    22
6.032210    18
6.032504    0

And execute t2plot

t2plot -t "PL reltime signal" -o 1:2 -ws 600,400 ~/results/skypeu_flows_nps.txt

The signal processing approach treats the PLs of a flow as a digital signal. Due to the fact that packets do not appear at regular intervals, the resulting signal has missing samples (s. fig below).

Packet Length Signal: flowInd 1, A flow, reltime starts at 0

If NFRST_IAT is 2, then a signal vector is produced with absolute timestamps. Let us use t2conf to change the value of NFRST_IAT, rebuild the plugin with t2build and rerun t2:

t2conf nFrstPkts -D NFRST_IAT=2

t2build nFrstPkts

t2 -r ~/data/skypeu.pcap -w ~/results

The signal should now like that:

PL1_ATime1;PL2_ATime2;PL3_ATime3;...

tawk 'flow(1) { print $dir, $L2L3L4Pl_Iat }' ~/results/skypeu_flows.txt

%dir    L2L3L4Pl_Iat
A       0_1146661308.742778;0_1146661308.742918;14_1146661308.764084;0_1146661308.790272;107_1146661308.790586;967_1146661308.811653;0_1146661308.862676;191_1146661308.880910;0_1146661308.942628;14_1146661308.943020;0_1146661314.470828;...
B       0_1146661308.742876;0_1146661308.764171;14_1146661308.790247;562_1146661308.800499;485_1146661308.822682;0_1146661308.921294;70_1146661308.942596;0_1146661308.943103;22_1146661314.429983;157_1146661314.482024;80_1146661314.533967;...

We can now use the fpsGplt script to produce signal with A positive, B negative PL of flow index 1 and t2plot to display it:

fpsGplt -f 1 -d A ~/results/skypeu_flows.txt


Generating '/home/wurst/results/skypeu_flows_nps.txt'... OK

t2plot -t "PL symmetric A flow, absolute times" -o 1:2 -ws 600,400 ~/results/skypeu_flows_nps.txt

Packet Length Signal: A/B flow, flowInd 1, absolute times

Signals are represented by complex numbers. They have amplitude and phase, a fact constantly ignored by some researchers. Nevertheless, due to the nature of Internet traffic, sometimes a quick fix by omitting time makes classifiers more resilient. Hence, the script fpsGplt has an additional parameter to replace time by an integer count, so a vector is produced by equidistant PL values, as depicted below.

fpsGplt -f 1 -d A -t ~/results/skypeu_flows.txt


Generating '/home/wurst/results/skypeu_flows_nps.txt'... OK

t2plot -t "PL signal" -o 1:2 -ws 600,400 ~/results/skypeu_flows_nps.txt

Packet Length Signal: flowInd 1, A flow, samples vector

It is obvious that the spectrum of the signal is now drastically distorted, but the vector can be easily processed by any AI which requires abstract vectored input. Nevertheless, from the signal processing standpoint, this representation does not make so much sense, unless the number on the x-axis where correctly sampled values. So how do we get there without much computational effort?

One obvious approach is to pick the smallest IAT and use 2/IAT as a sampling frequency which often produces large vector dimensions and slows down the classification process.

Another approach is to reconstruct the signal with well known methods already used in radar technology. Here, a sampling frequency is picked outside a bandwidth limited signal according to Shannon’s requirements, which contains most of the energy of the original signal (Gerchberg Papadopulous). Been there, done that. Lots of computational effort, requires specialized HW if really being considered. But, then the missing samples can be reconstructed with a much lower frequency, producing less samples.

So a less expensive and easier way is required which almost satisfies dear old Shannon, and it has to be implemented in tranalyzer in a performant way. Satisfying Shannon is easy, he is dead, satisfying the Anteater is more difficult.

The A/B flow signal

The representation of a packet flow into a signal is vital. One method is to produce an A and B flow signal as depicted below. In order to preserve the causal correlation between A and B signals, the B part has to be shifted by the start of the B flow. We will see later that there are complications by just combining A and B flows into a signal, because the full duplex nature of the IP protocol and asymmetric delays of the peers do not guarantee causality between A and B packets. Leaving that aside, for the sake of simplicity, let’s first produce a signal which we can investigate and plot.

In this section, we will need to configure the NFRST_IAT and NFRST_BCORR flags. Let us quickly check their current value and documentation:

grep -Fw -e 'define NFRST_IAT' -e 'define NFRST_BCORR' $T2PLHOME/nFrstPkts/src/nFrstPkts.h

#define NFRST_IAT         2 // 0: Time relative to flow start; 1: Inter-arrival time; 2: Absolute time
#define NFRST_BCORR       0 // 0: A,B start at 0.0; 1: B shift by flow start; if (NFRST_IAT == 0)

Now, set NFRST_IAT to 0 and NFRST_BCORR to 1 with t2conf, then recompile the plugin and rerun t2:

t2conf nFrstPkts -D NFRST_IAT=0 -D NFRST_BCORR=1

t2build nFrstPkts

t2 -r ~/data/skypeu.pcap -w ~/results

If A and B flow are to be considered as one signal, then the B flow needs to be shifted by its start time. NFRST_BCORR set to 1 produces that operation, resulting in the following output

tawk 'flow(1) { print $dir, $L2L3L4Pl_Iat }' ~/results/skypeu_flows.txt

%dir    L2L3L4Pl_Iat
A       0_0.000000;0_0.000140;14_0.021306;0_0.047494;107_0.047808;967_0.068875;0_0.119898;191_0.138132;0_0.199850;14_0.200242;0_5.728050;0_5.739293;0_5.791233;169_5.819997;82_5.821208;22_5.872195;...
B       0_0.000098;0_0.021393;14_0.047469;562_0.057721;485_0.079904;0_0.178516;70_0.199818;0_0.200325;22_5.687205;157_5.739246;80_5.791189;0_5.820125;22_5.820321;0_5.863052;0_5.872303;22_5.927133;...

Note that the B signal starts at 0.000098, which is the start of the B flow. A proper representation of the sequence above is the combined signal, where the B part is negated, thus also reducing the DC part in a natural way. So use fpsGplt to extract flow 1 A/B part, B inverted (-i), calculate the jumps in IAT (-j) and invoke t2plot:

fpsGplt -h

Usage:
    fpsGplt [OPTION...] <FILE_flows.txt>

Optional arguments:
    -f findex        Flow index to extract [default: all flows]
    -d A|B           Flow direction: A or B only [default: A and B]

    -s               Time sorted ascending
    -t               No time, but counts on x axis [default: time on x axis]
    -i               Invert B flow PL
    -p s             Sample sorted signal with smplIAT in [s]; f = 1/smplIAT
    -e s             Time for each PL pulse edge in [s]

    -j               Calculate the jumps in IAT and report appropriate values
                     for MINIAT(S/U)

    -P               Plot the packet signal

    --gif  file      Generate a GIF file
    --jpeg file      Generate a JPEG file
    --png  file      Generate a PNG file
    --svg  file      Generate a SVG file

Help and documentation arguments:
    -h, --help       Show this help, then exit

fpsGplt -f 1 -i -j ~/results/skypeu_flows.txt


Generating '/home/wurst/results/skypeu_flows_nps.txt'... OK
Generating '/home/wurst/results/skypeu_flows_iat_jmp.txt'... OK

t2plot -t "PL symmetric time signal from flow start" -o 1:2 -ws 600,400 ~/results/skypeu_flows_nps.txt

Packet Length Signal: flowInd 1, A/B flow, reltime

Zooming into the first part of the signal (right mouse click defines the area), we see a small B spike followed by a larger A peak. Alternatively, rerun t2plot using the -sx option to specify the x range to display:

t2plot -t "PL symmetric time signal from flow start" -o 1:2 -ws 600,400 ~/results/skypeu_flows_nps.txt -sx 0.043:0.071

Packet Length Signal: flowInd A, A/B flow, reltime, zoom

The smallest difference between A and B peak normally defines the minimum sampling frequency, which we like to be as low as possible to reduce the amount of unnecessary sampled 0 and for performance reasons. Let’s see what happens if we omit this A-B packet minimal inter-distance information and treat each flow separately to produce a signal which can be readily sampled with a lower enough frequency. Have a look at the PL/IAT vector above and pick the minimum required pulse length for your sampling frequency. (awkf is just an alias for awk -F'\t' -v OFS='\t')

awkf 'NR > 1 { print $1 }' ~/results/skypeu_flows_nps.txt | sort -u


0.000000
0.000097
0.000139
0.000140
0.000191
0.000196
0.000281
0.000294
0.000314
0.000334
0.000392
0.000397
0.000507
0.000525
0.001196
0.001211
0.009251  <----- 1. large jump in reltime
0.009259
0.010159
0.010252
0.011235
...

Looking also at the plot above you will notice the bursty nature of the packet length signal. The task is to replace the spikes with an appropriate pulse length allowing a minimal sampling frequency. Looking at the sorted IAT list above, a drastic jump at 0.009251 can be identified. Thus any aggregation IAT below 9000us would be fine. Lets choose 2000us because 1ms is a reasonable unit for voice traffic. The minimal default pulse width is defined by NFRST_MINIAT(S/U)/NFRST_MINPLENFRC in nFrstPkts.h. The default value of NFRST_MINPLENFRC is 2.

The -j option of fpsGplt helps you to make the decision about the best MINIAT(S/U):

cat ~/results/skypeu_flows_iat_jmp.txt


NFRST_MINIATS: 0, NFRST_MINIATU: 97, diff: 0.000097
NFRST_MINIATS: 0, NFRST_MINIATU: 294, diff: 0.000098
NFRST_MINIATS: 0, NFRST_MINIATU: 506, diff: 0.000115
NFRST_MINIATS: 0, NFRST_MINIATU: 1211, diff: 0.000704
NFRST_MINIATS: 0, NFRST_MINIATU: 9251, diff: 0.008040   <---- 1. large jump in IAT difference
NFRST_MINIATS: 0, NFRST_MINIATU: 42731, diff: 0.013795
NFRST_MINIATS: 0, NFRST_MINIATU: 95859, diff: 0.034141
NFRST_MINIATS: 5, NFRST_MINIATU: 486880, diff: 5.388268

Construction of a scannable signal

An obvious advantage of this aggregated flow signal representation in nFrstPkts is also the reduction of flow storage, as samples with packet length 0 are not needed anymore for signal by any post processing. This behavior is controlled by the NFRST_MINIATS and NFRST_MINIATU configuration flags:

grep -Fw -e '#define NFRST_MINIATS' -e '#define NFRST_MINIATU' $T2PLHOME/nFrstPkts/src/nFrstPkts.h

#define NFRST_MINIATS     0 // minimal IAT sec to define a pulse
#define NFRST_MINIATU     0 // minimal IAT usec to define a pulse

Let us set NFRST_MINIATU to 2000 with t2conf, recompile the plugin, rerun t2 and extract the flow 1 (A/B part) with fpsGplt.

t2conf nFrstPkts -D NFRST_MINIATU=2000

t2build nFrstPkts

t2 -r ~/data/skypeu.pcap -w ~/results

fpsGplt -f 1 -i ~/results/skypeu_flows.txt


Generating '/home/wurst/results/skypeu_flows_nps.txt'... OK

The format is then as follows: PL1_ReltimeSpike_PulseLength;PL2_ReltimeSpike_PulseLength;PL3_ReltimeSpike_PulseLength;...

tawk 'flow(1) { print $dir, $L2L3L4Pl_Iat_nP }' ~/results/skypeu_flows.txt

%dir    L2L3L4Pl_Iat
A       14_0.021306_0.001000;107_0.047808_0.001000;967_0.068875_0.001000;191_0.138132_0.001000;14_0.200242_0.001000;125_5.819997_0.002211;22_5.872195_0.001000;22_5.980476_0.001000;18_6.032210_0.001000;22_6.084144_0.001000;22_6.192150_0.001000;...
B       14_0.047469_0.001000;562_0.057721_0.001000;485_0.079904_0.001000;70_0.199818_0.001000;22_5.687205_0.001000;157_5.739246_0.001000;80_5.791189_0.001000;22_5.820321_0.001000;22_5.927133_0.001000;22_6.032473_0.001000;18_6.084457_0.001000;...

Now invoke t2plot using the -pl option, so that PL values are connected. This facilitates the recognition of signal characteristics.

t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -pl -ws 600,400 ~/results/skypeu_flows_nps.txt

Packet Length Signal: flowInd 1, A/B flow, reltime, B shifted, average PL, zoom

By using the -r option, you can use all mouse driven actions and look in detail at the signal by zooming using your mouse (ctrl wheel up). For more gnuplot mouse commands type

gnuplot

show bind


...
 <wheel-up>           scroll up (in +Y direction)
 <wheel-down>         scroll down
 <shift-wheel-up>     scroll left (in -X direction)
 <shift-wheel-down>   scroll right
 <Control-WheelUp>    zoom in on mouse position
 <Control-WheelDown>  zoom out on mouse position
...

Alternatively, use t2plot -sx to specify the range to display:

t2plot -t "PL symmetric A/B signal from flow start absolute times, zoom" -o 1:2 -pl -ws 600,400 /home/wurst/results/skypeu_flows_nps_srt.txt -sx 0:0.142


<figure>
    <a href="#" data-featherlight="/assets/img/LsigABShftSrtzm.png">
        <img class="img" style="max-width: 100%" src="/assets/img/LsigABShftSrtzm.png">
    </a>
    <figcaption>Packet Length Signal: flowInd 1, A/B flow, reltime, B shifted, zoom</figcaption>
</figure>
<!--![Packet Length Signal: flowInd 1, A/B flow, reltime, B shifted, zoom](/assets/img/LsigABShftSrtzm.png)-->

Note that around 0.044s, an A pulse is overlapping the B pulse. That is the effect mentioned before that IAT between A and B packets
are not considered to avoid high sampling frequencies. Sure enough, this is what needs to be done if we are really interested in being thorough.
An easy way to mitigate this effect is to consider A and B flow separately.

One approach is to shift every conflicting B pulse to the future, which tampers with the phase of the signal. For classification
purposes, a pragmatic choice. For signal freaks, a no-go. They will get the minimum A/B spike IAT and use a fraction of that as
a pulse length.

Because the A/B vectors are stored in sequence, the `-pl` option of `t2plot` plots lines crossing the pulse at 0. To produce a
consistent signal sorting by time is required.

<kbd>
awkf \'NR != 1\' ~/results/skypeu_flows_nps.txt | LC_ALL=C sort -t$\'\t\' -k1,1 | awkf \'BEGIN { print \"time\", \"PL\" } { print }\' > ~/results/skypeu_flows_nps_srt.txt
</kbd>

This works as well:

<kbd>
fpsGplt -f 1 -i -s ~/results/skypeu_flows.txt
</kbd>
<pre><samp>
Generating '/home/wurst/results/skypeu_flows_nps.txt'... <span class="code-ok">OK</span>
Generating '/home/wurst/results/skypeu_flows_nps_srt.txt'... <span class="code-ok">OK</span>
</samp></pre>

<kbd>
t2plot -t \"PL symmetric A/B signal from flow start absolute times, zoom\" -o 1:2 -pl -sx 0:0.142 -ws 600,400 ~/results/skypeu_flows_nps_srt.txt
</kbd>

<figure>
    <a href="#" data-featherlight="/assets/img/LsigABShftSrtAzmS.png">
        <img class="img" style="max-width: 100%" src="/assets/img/LsigABShftSrtAzmS.png">
    </a>
    <figcaption>Packet Length Signal: flowInd 1, A/B flow, reltime, B shifted, average PL, zoom</figcaption>
</figure>
<!--![Packet Length Signal: flowInd 1, A/B flow, reltime, B shifted, average PL, zoom](/assets/img/LsigABShftSrtAzmS.png)-->

The peaky signal around 0.044s is the overlapping A/B signal effect described above.

To conclude this tutorial, let's configure *nFrsPkts.h* as follows for the next pcap:

```c
...
/* ========================================================================== */
/* ------------------------ USER CONFIGURATION FLAGS ------------------------ */
/* ========================================================================== */

#define NFRST_IAT         0 // 0: Time relative to flow start;
                            // 1: Inter-arrival time;
                            // 2: Absolute time
#define NFRST_BCORR       1 // 0: A,B start at 0.0;
                            // 1: B shift by flow start; if (NFRST_IAT == 0)
#define NFRST_MINIATS     0 // Minimal IAT sec to define a pulse
#define NFRST_MINIATU     0 // Minimal IAT usec to define a pulse
#define NFRST_MINPLENFRC  2 // Minimal pulse length fraction
#define NFRST_PLAVE       1 // 1: Packet Length Average;
                            // 0: Sum(PL) (BPP); if (NFRST_MINIATS|NFRST_MINIATU) > 0
#define NFRST_PKTCNT    200 // Define how many first packets are recorded
#define NFRST_HDRINFO     0 // Add L3 and L4 header length
#define NFRST_XCLD        0 // 0: include all,
                            // 1: include [NFRST_XMIN,NFRST_XMAX]

#define NFRST_XMIN        1           // Min PL boundary; NFRST_XCLD=1
#define NFRST_XMAX        UINT16_MAX  // Max PL boundary; NFRST_XCLD=1

/* ========================================================================== */
/* ------------------------- DO NOT EDIT BELOW HERE ------------------------- */
/* ========================================================================== */
...

This can be achieved with t2conf as follows:

t2conf nFrstPkts -D NFRST_IAT=0 -D NFRST_BCORR=1 -D NFRST_MINIATS=0 -D NFRST_MINIATU=0 -D NFRST_MINPLENFRC=2 -D NFRST_PLAVE=1 -D NFRST_PKTCNT=200 -D NFRST_HDRINFO=0 -D NFRST_XCLD=0 -D NFRST_XMIN=1 -D NFRST_XMAX=UINT16_MAX

You can add the L3/4 header length to the PL by setting NFRST_HDRINFO. But then, all discussed signal forming modes will be deactivated. The NFRST_XCLD controls the exclusion of a certain PL range. The range is defined by NFRST_XMIN, NFRST_XMAX This is useful when certain PLs are not relevant for the classification process. Instead of weeding them out by the classifier itself, we can remove them before, thus reducing the size of the model or facilitating the feature extraction process.

Analyzing traffic of a film being streamed

Now download a more complicated PCAP where somebody streams a film: film.pcap

t2build nFrstPkts

t2 -r ~/data/film.pcap -w ~/results

fpsGplt -f 13 -i -s ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, average PL

In order to produce a signal which can be used in AI applications or as a valid sample signal, minimal pulse length has to be estimated. So set the NFRST_IAT parameter to 1, recompile the plugin, execute T2 and run fpsGplt for the whole flow with the -j option:

t2conf nFrstPkts -D NFRST_IAT=1

t2build nFrstPkts

t2 -r ~/data/film.pcap -w ~/results

fpsGplt -f 13 -j ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_iat_jmp.txt'... OK

cat ~/results/film_flows_iat_jmp.txt


NFRST_MINIATS: 0, NFRST_MINIATU: 1, diff: 0.000001
NFRST_MINIATS: 0, NFRST_MINIATU: 3, diff: 0.000001
NFRST_MINIATS: 0, NFRST_MINIATU: 5, diff: 0.000001
NFRST_MINIATS: 0, NFRST_MINIATU: 34, diff: 0.000029
NFRST_MINIATS: 0, NFRST_MINIATU: 195, diff: 0.000075
NFRST_MINIATS: 0, NFRST_MINIATU: 1596, diff: 0.000086   <--- 1. try 500-1500
NFRST_MINIATS: 0, NFRST_MINIATU: 1849, diff: 0.000107
NFRST_MINIATS: 0, NFRST_MINIATU: 2752, diff: 0.000199   <--- 2. try 2000
NFRST_MINIATS: 0, NFRST_MINIATU: 3075, diff: 0.000285
NFRST_MINIATS: 0, NFRST_MINIATU: 3724, diff: 0.000521
NFRST_MINIATS: 0, NFRST_MINIATU: 5582, diff: 0.000580   <--- 3. try 4000
NFRST_MINIATS: 0, NFRST_MINIATU: 9400, diff: 0.003818   <--- 4. try 6000 - 9000
NFRST_MINIATS: 0, NFRST_MINIATU: 72384, diff: 0.049071  <--- 5. try 20000 - 60000
NFRST_MINIATS: 1, NFRST_MINIATU: 73796, diff: 0.985782

So let’s try 2000 for a start and set NFRST_IAT to relative mode, i.e., 0. Again rebuild the plugin, rerun t2 and fpsGplt, then plot the result with t2plot:

t2conf nFrstPkts -D NFRST_IAT=0 -D NFRST_MINIATS=0 -D NFRST_MINIATU=2000

t2build nFrstPkts

t2 -r ~/data/film.pcap -w ~/results

fpsGplt -f 13 -i -s ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -pl -sx 0:0.22 -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 2ms, average PL, zoom

Now, let’s try with the 4th value:

t2conf nFrstPkts -D NFRST_MINIATU=9000

t2build nFrstPkts

t2 -r ~/data/film.pcap -w ~/results

fpsGplt -f 13 -i -s ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -pl -sx 0:0.22 -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 9ms, average PL, zoom

The edge of the pulses is controllable via the -e option. The default edge is 0.000010s. Let us try with 0.002s!

fpsGplt -f 13 -i -s -e 0.002 ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -pl -sx 0:0.22 -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 9ms, average PL, edge=0.002, zoom

This is one way to reduce the amount of side-lobes in the spectrum.

Sampling the constructed signal

Let us now sample the signal with the default edge. The -p factor defines the IAT in seconds of the sampling pulses.

fpsGplt -f 13 -i -p 0.0025 ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt_smpl.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -sx 0:0.22 -ws 600,400 ~/results/film_flows_nps_srt_smpl.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 9ms, average PL, sampled 0.0025s, zoom

This signal can be fed into any signal processing algorithm. Just read the sample in the sample file:

cat ~/results/film_flows_nps_srt_smpl.txt

0.000000        0
0.002500        0
0.005000        0
0.007500        0
0.010000        0
0.012500        231
0.015000        231
0.017500        0
0.020000        0
0.022500        0
0.025000        -1200
0.027500        -1200
0.030000        0
0.032500        0
0.035000        291
0.037500        291
0.040000        0
0.042500        0
0.045000        -294
0.047500        -294
0.050000        0
...

So you see, gnuplot does not show the PL 0 in the chosen plot mode, but they are there in the sampled file.

BPB measure

For AI researchers who are just interested in acquiring the best feature for their Neural Net without regarding the time dependence, the so called Bytes-Per-Burst (BPB) measure can be approximated by the sum(PL) pulse signal.

The nFrstPkts plugin has a NFRST_PLAVE configuration flag which can be used for this purpose:

grep -Fw NFRST_PLAVE $T2PLHOME/nFrstPkts/src/nFrstPkts.h

#define NFRST_PLAVE       1 // 1: Packet Length Average; 0: Sum(PL) (BPP); if (NFRST_MINIATS|NFRST_MINIATU) > 0

Let us switch it to 0 with t2conf:

t2conf nFrstPkts -D NFRST_PLAVE=0

t2build nFrstPkts

t2 -r ~/data/film.pcap -w ~/results

fpsGplt -f 13 -i -s ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal from flow start, rel time, zoom" -o 1:2 -sx 0.015:0.22 -pl -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 9ms, sum(PL), zoom

Choose a higher NFRST_MINIATU according to your detail requirements of the classification process, remove the time info and you have the Bytes-Per-Burst (BPB) measure.

fpsGplt -f 13 -i -t -s ~/results/film_flows.txt


Generating '/home/wurst/results/film_flows_nps.txt'... OK
Generating '/home/wurst/results/film_flows_nps_srt.txt'... OK

t2plot -t "PL symmetric A/B signal, flowInd 13, rel time" -o 1:2 -pl -ws 600,400 ~/results/film_flows_nps_srt.txt

Packet Length Signal: flowInd 13, A/B flow, reltime, 9ms, BPB

If you need it non inverted, omit the -i option.

Now what? What can you do with it now? That is discussed in our next AI tutorial Classification of encrypted video streams.

Conclusion

Do not forget to reset all constants if you want to follow other tutorials:

t2conf nFrstPkts -D NFRST_IAT=1 -D NFRST_BCORR=0 -D NFRST_MINIATU=0

t2build nFrstPkts

(Encrypted) traffic mining

Contents

Introduction

Prerequisites

Create folders for your data and results

Reset tranalyzer2 and the plugins configuration

Empty the plugin folder

Download the PCAP files

Getting started

Build tranalyzer2 and the required plugins

Run tranalyzer2

Troubleshooting

Statistical approach

The packet length inter-arrival time distribution

Descriptive statistics

Signal approach

Signal preprocessing features nFrstPkts

The basic signal

The A/B flow signal

Construction of a scannable signal

Analyzing traffic of a film being streamed

Sampling the constructed signal

BPB measure

Conclusion