Talk 1: Evaluation of Chatbots for Cybersecurity NLP Tasks
Abstract: In the rapidly advancing field of cybersecurity, knowledge sharing about emerging threats is crucial and forms the foundation for Cyber Threat Intelligence (CTI). In this context, Large Language Models (LLMs) have become vital, revealing a wide range of possibilities in cybersecurity, thereby revolutionizing it. This study explores the capability of LLM-based chatbots, such as ChatGPT, GPT4all, Dolly, Stanford Alpaca, QLora, and Falcon, to identify cybersecurity-related text within contextual Open Source Intelligence (OSINT). We assess existing chatbots’ capability to spot cyber-centric contexts and their capacity for Zero-Shot Learning (ZSL) in Natural Language Processing (NLP) tasks. We take binary classification and Named Entity Recognition (NER) into account for NLP tasks and the Twitter dataset for our evaluation. When there is no specific trained model for these types of tasks, the evaluation is especially applicable. This study demonstrates the adaptability of these chatbots only in specific NLP tasks, such as cybersecurity binary classification, while highlighting the need for further refinement in other tasks, such as NER.
Talk 2: Large-Scale Distributed Similarity Search with Locality-Sensitive Hashing
Abstract: Locality sensitive hashing (LSH) is an efficient hash-based solution for the Approximate Nearest Neighbor (ANN) problem in high-scale, high-dimensional environments. In this talk, I will present a distributed LSH approach proposed in my MSc thesis, which focuses on balancing storage resources across the system’s nodes with low storage overhead and without compromising LSH’s search efficiency. The proposed solution uses erasure codes to store object information directly into the LSH’s index nodes, removing the need for additional storage structures while providing fault tolerance. We also include a process optimization that improves the solution’s query performance by directly comparing LSH hash values instead of whole object values. The implemented open source framework is capable of adapting to diverse distributed LSH approaches using several LSH hashing algorithms and storage backends. Our evaluations indicate that the proposed model not only balance storage resources through the system nodes, but also scale well with the size of objects and its optimized variant significantly increases the solution performance at the cost of some service quality.
Samaneh Shafee is a PhD student at the Department of Informatics of the Faculty of Sciences of the University of Lisbon. Her interests are cybersecurity in general and cybersecurity language models to support machine learning-based cyberthreat intelligence. She is working under the supervision of Professors Pedro M. Ferreira and Alysson Bessani.
João Queimado is a computer science and engineering master’s student at FCUL. He is currently working in thesis research under the supervision of professors Vinicius Vielmo Cogo and Ibéria Vitória de Sousa Medeiros.