Analyzing Cyber Defense Competencies:
A Study of the NCCDC2016 Dataset and its Implications
Yoshito Kanamori University of Alaska Anchorage
Abstract In cybersecurity research, the ideal scenario is to study intrusion detection in actual network environments. However, this is challenging due to the presence of sensitive institutional and employee data in the network traffic. Consequently, many researchers often use older public datasets, which may not accurately reflect current cyber threats. It is evident that the field requires more recent and illustrative network traffic datasets.
DHS's IMPACT project offers datasets from the National Collegiate Cyber Defense Competitions (NCCDC). These competitions challenge college teams to sustain a fictitious company's operations amid continuous cyber threats. The network data contains interactions between customers, employees (i.e., student team), and other company personnel, replicating a genuine business setting.
The NCCDC dataset's notable advantage is its ability to illuminate attackers' tactics against varying defense capabilities. Each participating team starts with identical network configurations, but their defense readiness differs. For instance, while a seasoned participant like the University of Central Florida, having won the NCCDC multiple times, displays strong defenses, newcomers might be more vulnerable. This disparity allows one to observe how the red team (attacker)'s strategies perform against various defensive preparations, as assessed through detailed network traffic analysis. Surprisingly, despite its seemingly optimal structure for cyber research, the NCCDC dataset remains underutilized in scholarly research.
This research analyzes the NCCDC2016 dataset sourced from the IMPACT website. The dataset totals 1.8 TB of data, divided into 931 files. Each of these 2 GB files contains a range of several million to 16.5 million packets. In light of our analysis, the NCCDC 2016 dataset presents several challenges for researchers. A significant portion of the data is filled with excessive traffic, which makes it challenging to filter and pinpoint pertinent information. The competition's network connects various entities: student teams, attackers, customers (orange team), and NCCDC's operations to a central core switch. Consequently, the 1.8 TB dataset includes competition-centric traffic and unrelated data like router communications. We filtered unrelated data packets with the help of Wireshark's TShark tool and Python libraries such as Scapy and Netaddr. We then segmented the traffic data into ten distinct sets for in-depth analysis, with each representing an individual team's interactions with external entities. Furthermore, the documentation accompanying the dataset might be insufficient for those unfamiliar with the specific rules and intricacies of the NCCDC. Such insufficiencies might pose difficulties in the effective utilization of this dataset. This paper offers supplementary information and an in-depth overview of the dataset's components, intending to facilitate its utilization by cybersecurity researchers.
Keywords: NCCDC Dataset, Network Traffic Analysis, Cybersecurity, Data Analytics, IMPACT Project