Violent Python
- Authors: R. Natella, P. Liguori, C. Improta. B. Cukic, D. Cotroneo
- Date: 01 Feb 2024
- Paper: AI Code Generators for Security: Friend or Foe?
- Published at: IEEE Security & Privacy Magazine
This dataset has been designed for training and evaluating AI code generators for security. Each sample consists of a piece of Python code, and the corresponding description in natural language (English).
We built the dataset by using the popular book Violent Python, by T. J. O’Connor, which presents several examples of offensive programs using the Python language. The dataset covers multiple areas of offensive security, including penetration testing, forensic analysis, network traffic analysis, and OSINT and social engineering.
The dataset consists of 1,372 unique samples. We describe offensive code in natural language at granularity of individual lines, of groups of lines (blocks), and of entire functions.
In the paper, we used this dataset to experiment with three AI code generators (CodeBERT, Github Copilot, Amazon CodeWhisperer) at generating offensive Python code.
PoisonPy
- Authors: D. Cotroneo, C. Improta, P. Liguori, R. Natella
- Date: 01 Apr 2024
- Paper: Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks
- Published at: 32nd IEEE/ACM International Conference on Program Comprehension (ICPC)
This dataset has been designed to perform a targeted data poisoning attack on AI code generators, leading them to generate vulnerable code. Each sample consists of a piece of Python code, and the corresponding description in natural language (English).
The dataset contains 823 unique pairs of code description–Python code snippet, including both safe and unsafe (i.e., containing vulnerable functions or bad patterns) code snippets.
The detailed organization of the dataset is described in the README.md file.
To build the dataset, we combined the only two available (at the time) benchmark datasets for evaluating the security of AI-generated code, SecurityEval and LLMSecEval. Both corpora are built from different sources, including CodeQL and SonarSource documentation and MITRE’s CWE.
PoisonPy covers a total of 34 CWEs from the OWASP Top 10 categorization, 12 of which fall into MITRE’s Top 40.
In the paper, we used the dataset to assess the susceptibility of three AI code generators (Seq2Seq, CodeBERT, CodeT5+) to our targeted data poisoning attack.
ROSPaCe
- Authors: Tommaso Puccetti, Simone Nardi, Cosimo Cinquilli, Tommaso Zoppi, Andrea Ceccarelli
- Date: 01 May 2024
- Paper: ROSPaCe: Intrusion Detection Dataset for a ROS2-Based Cyber-Physical System and IoT Networks
- Published at: Scientific Data, Vol. 11.1, Article no. 481
ROSPaCe is a dataset for intrusion detection composed by performing penetration testing on SPaCe, an embedded cyber-physical system built over Robot Operating System 2 (ROS2). Features are monitored from three architectural layers: the Linux operating system, the network, and the ROS2 services. We perform attacks through the execution of discovery and DoS attacks, for a total of 6 attacks, with 3 of them specific to ROS2. We collect data from the network interfaces, the operative system, and ROS2, and we merge the observations in a unique dataset using the timestamp. We label each data point indicating if it is recorded during the normal (attack-free) operation, or while the system is under attack. The dataset is organized as a time series in which we alternate sequences of normal (attack-free) operations, and sequences when attacks are carried out in addition to the normal operations. The goal of this strategy is to reproduce multiple scenarios of an attacker trying to penetrate the system. Noteworthy, this allows measuring the time to detect an attacker and the number of malicious activities performed before detection. Also, it allows training an intrusion detector to minimize both, by taking advantage of the numerous alternating periods of normal and attack operations. The final version of ROSPaCe includes 30 247 050 data points and 482 columns excluding the label. The features are 25 from the Linux operating system, 5 from the ROS2 services, and 422 from the network. The dataset is encoded in the complete_dataset.csv file for a total of 40.5 GB. The dataset contains about 23 million attack data points and above 6.5 million normal data points (78% attacks, 22% normal). We provide a lightweight version of the ROSpace dataset by selecting the best-performing 60 features. This includes the 30 features from the Linux operating system, the ROS2 services, and the 30 best-performing features from the network.
PowerShell Offensive Code Generation
- Authors: P. Liguori, C. Marescalco, R. Natella, V. Orbinato, L. Pianese
- Date: 01 Aug 2024
- Paper: The Power of Words: Generating PowerShell Attacks from Natural Language
- Published at: 18th USENIX WOOT Conference on Offensive Technologies (WOOT 24)
This repo provides a replication package for the paper The Power of Words: Generating PowerShell Attacks from Natural Language, presented at the 18th USENIX WOOT Conference on Offensive Technologies (WOOT 2024).
In this paper, we present an extensive evaluation of state-of-the-art NMT models in generating PowerShell offensive commands.
We also contribute with a large collection of unlabeled samples of general-purpose PowerShell code to pre-train NMT models to refine their capabilities to comprehend and generate PowerShell code. Then we build a manually annotated labelled dataset consisting of PowerShell code samples specifically crafted for security applications which we pair with curated Natural language descriptions in English.
We use this dataset to pre-train and fine-tune:
- CodeT5+
- CodeGPT
- CodeGen
We also evaluate the model with:
- Static Analysis in which the generated code is assessed to ensure that it adheres to PowerShell programming conventions
- Execution Analysis which evaluates the capabilities of the generated offensive PowerShell code in executing malicious action
The project includes scripts and data to repeat the training/testing experiments and replicate evaluations.
Robustness of AI Code Generators
- Authors: C. Improta, P. Liguori, R. Natella, B. Cukic, D. Cotroneo
- Date: 01 Oct 2024
- Paper: Enhancing robustness of AI offensive code generators via data augmentation
- Published at: Empirical Software Engineering
This repository contains the code, the dataset and the experimental results related to the paper Enhancing Robustness of AI Offensive Code Generators via Data Augmentation.
The paper presents a data augmentation method to perturb the natural language (NL) code descriptions used to prompt AI-based code generators and automatically generate offensive code. This method is used to create new code descriptions that are semantically equivalent to the original ones, and then to assess the robustness of 3 state-of-the-art code generators against unseen inputs. Finally, the perturbation method is used to perform data augmentation, i.e., increase the diversity of the NL descriptions in the training data, to enhance the models’ performance against both perturbed and non-perturbed inputs.
This repository contains:
-
Extended Shellcode IA32, the assembly dataset used for the experiments, which we developed by extending the publicly available Shellcode IA32 dataset for automatically generating shellcodes from NL descriptions. This extended version contains 5,900 unique pairs of assembly code snippets/English intents, including 1,374 intents (~23% of the dataset) that generate multiple lines of assembly code (e.g., whole functions).
-
The source code to replicate the injection of perturbations by performing word substitutions or word omissions on the NL code descriptions (code folder). This folder also contains a README.md file detailing how to set up the project, how to change the dataset if needed, and how to run the code.
-
The results we obtained by feeding the perturbed code descriptions to the AI models, i.e., Seq2Seq, CodeBERT and CodeT5+ (paper results folder). This folder also contains the evaluation of the models’ performance on single-line vs. multi-line code snippets and the results of a survey we conducted to manually assess the semantic equivalence of perturbed NL descriptions to their original counterpart.