Cybersecurity that thinks
Until recently, using the terms “data science” and “cybersecurity” in the same sentence would have seemed odd. Cybersecurity solutions have traditionally been based on signatures – relying on matches to patterns identified with previously identified malware to capture attacks in real time. In this context, the use of advanced analytical techniques, big data and all the traditional components that have become representative of “data science” have not been at the center of cybersecurity solutions focused on identification and prevention of cyber attacks.
LEARN MORE
This is not surprising. In a signature-based solution, any given malware or new flavor of it needs to be identified, sometimes reverse-engineered and have a matching signature deployed in an update of the product in order to be “detectable.” For this reason, signature-based solutions are not able to prevent zero-day attacks and provide very limited benefit compared to the predictive power offered by data science.
Among the many definitions of data science that have emerged in the last few years, “gaining knowledge from data using a scientific approach” best captures some of the different components that characterize it.
An unprecedented number of companies that have reported breaches in 2014; evidence that existing cybersecurity solutions are not effective at identifying malware or detecting attackers inside an organization’s network.
Three technological advances enable data science to deliver new innovative cybersecurity solutions:
Storage – the ease of collecting and storing large amount of data on which analytics techniques can be applied (distributed systems as cluster deployments).
Computing – the prompt availability of large computing power allows easy use of sophisticated machine learning techniques to build models for malware identification.
Behavior – the fundamental transition from identifying malware with signatures to identifying the particular behaviors an infected computer will exhibit.
Let's discuss more in depth how each of the items above can be used for a rigorous application of data science techniques to solve today's cybersecurity problems.
Having a large amount of data is of paramount importance in building analytical models that identify cyber attacks. For either a heuristic or refined model based on machine learning, large numbers of data samples need to be analyzed to identify the relevant set of characteristics and aspects that will be part of the model – this is usually referred to as “feature engineering”. Then data needs to be used to cross check and evaluate the performance of the model – this should be thought of as a process of training, cross validation and testing a given “machine learning” approach.
One of the reasons for the recent increase in machine learning’s popularity is the prompt availability of large computing resources: Moore’s law holds that the processing power and storage capacity of computer chips double approximately every 24 months.
These advances have enabled the introduction of many off-the-shelf ‘machine learning’ packages that allow training and testing of machine learning algorithms of increasing complexity on large data samples. These two factors make the use of machine learning practical for use in cybersecurity solutions.
There is a distinction between data science and machine learning, and we will discuss in a dedicated post how machine learning can be used in cybersecurity solutions, and how it fits into the more generic solution of applying data science in malware identification and attack detection.
The fundamental transition from signatures to behavior for malware identification is the most important enabler of applying data science to cybersecurity. Intrusion Prevention System (IPS) and Next-generation Firewall (NGFW) perimeter security solutions inspect network traffic for matches with a signature that has been created in response to analysis of specific malware samples. Minor changes to malware reduce the IPS and NGFW efficacy. However, machines infected with malware can be identified through the observation of their abnormal, post-infection, behavior. Identifying abnormal behavior requires primarily the capability of first identifying what's normal and the use rigorous analytical methods – data science – to identify anomalies.
http://www.computerworld.com/article/2881551/creating-cyber-security-that-thinks.html?phint=newt=computerworld_security&phint=idg_eid=2bb689d07643a520469baa93e05ca014#tk.CTWNLE_nlt_security_2015-02-23