Tutorial: Classification of encrypted video streams

This tutorial explains how to use Tranalyzer to extract the bytes-per-burst (BPB) feature from TLS encrypted YouTube video streams and recognize what video title is contained in a new test sample, or detect that it is a new video. This is an implementation of recent work by Dubin et al. [1]

The bytes-per-burst feature

A flow is viewed as a signal of packets over time. This signal is then transformed in a series of bursts. A burst here is defined as the set of packets that were recorded within a certain time window of each other. (Note that this is not a regular binning of the time dimension as a burst can be arbitrarily large in the time dimension as long as the next packet arrives within that window.) Each burst corresponds to the sum of bytes contained in all the packets aggregated into that burst. The total number of bytes in each burst for a given flow is then used to characterize this flow.

Extracting bursts from a flow

Given the following series of packets:

Using a time window of 50ms in the nFrstPkts plugin, the following bursts are extracted:

(Note the logarithmic scale on the y-axis and the changing y-limits between the two plots.)

Plots like this can also be generated for any given flowfile using the fpsGplt and t2plot scripts. More information on this can be found in the documentation or in the traffic mining tutorial.

Identifying YouTube flows

In order to identify YouTube flows in a larger PCAP traffic dump, the Server Name Indication (SNI) TLS extension is used. The sslDecode plugin for Tranalyzer makes the server name available in the sslServerName column in the flowfile.

Prerequisites

  • Tranalyzer version 0.8.1 lm 4 or higher,
  • A folder containing your training data:
    • Dubin et al [1] provide their data at: http://www.cse.bgu.ac.il/title_fingerprinting/dataset_chrome_100
    • The PCAP files used for training are expected to be at the following location: DATA_PATH/{Class1,Class2,...,ClassN}/Train/*.pcap

Implementation

We are going to use a few modules:

Configuration

We need the path to the Tranalyzer directory, and to the PCAP files that we are going to be using to create our model.

Setting up Tranalyzer

First, we need to set up Tranalyzer to include the required non-standard plugins:

  • nFrstPkts to get the signal for the first few packets of a flow, and
  • sslDecode to identify YouTube flows.

We can use t2conf to configure the plugins to our liking. For this tutorial, we set the minimum time window that defines a burst to 50ms and the number of packets to analyze to 200.

To build Tranalyzer and the plugins, we use the included autogen build script.

Running Tranalyzer

We need a function that, given a PCAP file, runs Tranalyzer to determine the bursts, and returns the path to the resulting flow file for further processing:

Extract the BPB features

Given a flowfile, we now need to extract a list of numbers that correspond to the total number of bytes in each burst. For this, we run a small (T)AWK script and some basic postprocessing that we execute using tawk.

Let’s check the output using a random PCAP in our training data path:

[434420, 1595848, 359344, 1811560, 351882, 1682472, 1665230, 329019, 2101566, 1793397, 356024, 1789932, 2101566, 338444, 2101566, 1377684, 341082, 1830971, 2084460, 337281, 2101566, 1792682, 328908, 2101566, 1957214, 445567, 1129815]

We extract the bursts for several PCAPS in parallel to speed up the process. Each thread is given a PCAP file, runs Tranalyzer, extracts the bursts and stores them into a unique location per thread.

Now let’s look at the BPBs for all samples of a class. For each sample of this class, we get a list of numbers representing the number of bytes in a burst:

[[4110, 129278, 778713, 268405, 3728018, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4990, 66262, 7826406, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4110, 66262, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4400, 129278, 149475, 627878, 268405, 402514, 1608766, 405435, 1203498, 4098388, 384251, 3486785, 379773, 1906720, 1938862, 375686, 1108149], [4110, 129278, 149475, 896283, 2011280, 405435, 4931516, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4110, 129278, 1287473, 670919, 1554910, 2094745, 8215957, 384251, 3335118, 379773, 3881665]]

Building the model

Using these building blocks, we can now write a function learn that, given a list of classes, fetches the corresponding PCAP files from the user-defined path at the top in this script, extracts the bursts for all PCAPs of each class using Tranalyzer, and stores the resulting features in a dictionary.

We persist this dictionary to disk so that we can call this program again in test mode, give it a new PCAP, and get the video title that most closely matches the unknown sample, given the model.

Note that here, we are storing the bursts in a list rather than a set (as in the paper). This helps us understand the data better when exploring it visually later on. It does not have an impact on classification accuracy.

Classification

To classify an unknown sample, a simple nearest neighbor approach is used.

A video is represented as a set of integers, each representing one burst in the signal.

The unknown sample is classified to the video title of the known sample that shares the most bursts in common with the unknown sample.

Program options

Parsing the options for our program:

Exploring the data

To understand our data better, we can plot the extracted bursts for each sample and examine them side by side.

Putting it all together

We first train our model using the training data. Afterwards, the program can be run in test mode and will output the top matches for a new unknown encrypted video stream sample.

def main(argv=None):
    args = parse_args(argv)
    
    if args.setup or definitely_need_setup_t2():
        print('Setting up tranalyzer ... ', end='')
        setup_t2()
        print('done')

    # run t2 on all PCAPs and extract BPB set for each class
    if args.mode[0] == 'learn':
        print('Storing learned BPB models to file: {}'.format(BPBS_PATH))
        if isfile(BPBS_PATH) and not args.force:
            print('WARNING: The model file {} already exists. Overwrite it? [yN] '.format(BPBS_PATH), end='')
            if input().lower() != 'y':
                print('Exiting.')
                sys.exit(1)
        classes = [basename(normpath(p)) for p in glob.glob('{}/**/'.format(DATA_PATH), recursive=False)]
        classes = classes[:10]
        print('Found {} classes, building model now ...'.format(len(classes)))
        bpbs = learn(classes, write_to=BPBS_PATH)
        print('Done building model, ready to test now.')
        
        # NOTE: Collected BPB features are stored in a file and are used when this program is
        # invoked in "test" mode.
        
        if VERBOSE:
            for c in classes[:3]:
                print('Bursts for class {}:'.format(c))
                plot_bursts(list(bpbs[c]), c) # Plot bursts of whichever class happens to be the first.
    else:
        if not args.test:
            print("Missing path to test PCAP file. See help page.")
            sys.exit(1)

        pcap_test = args.test
        bpbs = load_bpbs(BPBS_PATH)
        bpb_test = extract_bpb(pcap_test)
        
        print('Bursts of test sample ({}):'.format(basename(pcap_test)))
        plot_bursts([bpb_test], 'Test sample')
        
        top = nearest_neighbors(bpb_test, bpbs, 3)
        result = top[0][0]
        print('Classification result:  {}'.format(result))
        print('')
        print('(Top 3: {})'.format(top))
        plot_bursts(bpbs[result], result)

For the purposes of this tutorial, let’s first train our model, then test it on a new PCAP:

Storing learned BPB models to file: bpbs.data
WARNING: The model file bpbs.data already exists. Overwrite it? [yN] y
Found 10 classes, building model now ...
Extracting bursts for class Jennifer_Lopez_On_The_Floor
Extracting bursts for class Democratic_Town_Hall
Extracting bursts for class Fast_and_Furious_six
Extracting bursts for class Coolio_Gangsters_Paradise
Extracting bursts for class Lenny_Kravitz_American_Woman
Extracting bursts for class Disconnect
Extracting bursts for class Jungle_Book
Extracting bursts for class Robbie_Williams_Supreme
Extracting bursts for class Meghan_Trainor_All_About_That_Bass
Extracting bursts for class fifty_Cent_In_Da_Club
Done building model, ready to test now.
----------------------------------------
Bursts of test sample (Fast_and_Furious_six_Train00_40_30.pcap):

Classification result:  Fast_and_Furious_six

(Top 3: [('Fast_and_Furious_six', 30), ('Disconnect', 1), ('Jungle_Book', 1)])

Conclusion

We wrote a short python program to build and train a nearest neighbor model to classify encrypted YouTube video streams using Tranalyzer.

Download the jupyter notebook for this tutorial here.

If you have any questions or feedback, please do not hesitate to contact us!

References

[1] R. Dubin, A. Dvir, O. Pele and O. Hadar, “I Know What You Saw Last Minute—Encrypted HTTP Adaptive Video Streaming Title Classification,” in IEEE Transactions on Information Forensics and Security, vol. 12, no. 12, pp. 3039-3049, Dec. 2017. doi: 10.1109/TIFS.2017.2730819 https://ieeexplore.ieee.org/document/7987775