Vous êtes sur la page 1sur 6

Data Science: The Engine to Power Next-Generation

Cybersecurity
The content of this blog post has been published as the white-paper at
the ThreatTrack official website. Please go ahead and download a PDF copy.
The high-profile data breaches of recent years paint a grim picture of our
cybersecurity reality. It is well recognized that conventional Security Information
and Event Management (SIEM) solutions and other legacy security products cant
keep up with todays quickly evolving offensive technologies, creating the need
for a new generation of cybersecurity. At the core of that next generation of
cyber-defense technologies will be ensuring that all cybersecurity stakeholders
have greater situational awareness about their network and their adversaries.
Cyber situational awareness is by no means a new concept. In the 2009 book
Cyber Situational Awareness: Issues and Research, the authors define at least
seven awareness aspects that cyber-defenders should have. They are:
1.
2.
3.
4.
5.
6.
7.

Awareness
Awareness
Awareness
Awareness
Awareness
Awareness
Awareness

of
of
of
of
of
of
of

the current situation


the impact of an attack
how situations evolve
actors behavior
why and how the current situation is caused
the quality of discoveries learned from the collected information
plausible outcomes of the current situation

The authors further generalizes the view of cyber situational awareness into
three phases:
1. Situation recognition (including awareness aspects one and six)
2. Situation comprehension (including awareness aspects two, four and five)
3. Situation projection (including awareness aspects three and seven)
Gaining that awareness with legacy technologies is difficult. For large-scale
networks, there is an enormous number of nodes with complex branches
channeling gigantic amounts of data flow, and network environments and
application platforms are also usually heterogeneous. Meanwhile, the
cybercriminal industry is becoming increasingly service-based. The crime-asaservice business model has fueled the trend for higher integration and
automation in cyber-attacks, creating broader, easier access to advanced
malware powering sophisticated, multi-stage cyber-attack campaigns.
How data science boosts situational awareness
Big data is not new to the cybersphere, but it remains misunderstood by many.
Big data is a reality when it comes to securing cyberspace, but not a solution
unto itself. The Google search results for cybersecurity big data are mostly
misleading because many of the authors of the articles confused data science
with big data. Consider that billions of users are on the Internet every day
browsing social media websites, conducting business transactions, transferring
pictures and files, or searching information. Big data focuses on collecting and
managing large amounts of data both in quantity and variety, while data science
aims to extract the underling knowledge and insights from the data.
Phishing is also how a user at IT security company RSA was tricked into opening
a document with malicious code. In a well-crafted APT, a group of RSA employees
were sent a spreadsheet called 2011 Recruitment Plan twice in two days. One

employee took the bait, even retrieving it from the email junk folder. A Zero-day
exploit in the document swiped data from the company.As sophisticated as they
are, APTs still require something as simple as a mouse click to get going. And
even a security company can fall prey to phishing because of peoples propensity
for clicking items in their email.
With advances in big data and data science technologies over the past decade,
full situational awareness in cybersecurity is becoming possible. On top of the
collected raw data, data science is constructing the awareness for detecting
undiscovered malicious actions, and thereby becoming the engine to power
nextgeneration cybersecurity solutions. Data science involves a wide range of
techniques to gain cyber situational awareness. Those techniques include data
fusion, data mining, feature engineering, predictive learning and visualization.
Data Fusion
Situational awareness is feasible only when there is a comprehensive data set
that describes the situation. The data may come from all kinds of devices on the
network, such as routers, servers, workstations, mobile devices, etc, or the log
files on a computer that records every activity that happens on that machine.
This raw data is often unstructured, unorganized and uncleaned. Pre-processing
is required to structurize and normalize the data in a meaningful way for further
processing. Data from multiple sources needs to be correlated and merged.
Fused data provides more realistic and less biased information about the entities
being monitored. The data fusion process can be viewed at three levels: data
level,feature level and decision level. At the data level, raw data is enriched
by meaningfully connecting it to other data. For example, an IP address would
carry more information with whois queries and geo-locations. At the feature
level, the fusion of features provides a more comprehensive description of an
entity. At the decision level, insights are weighed across a set of tools. A wellknown example is VirusTotal, which provides scanning reports from multiple
antivirus agents.
Data Mining
The data fusion process generates a plethora of data stored in data servers. Data
mining is the next step to distinguish signals from noises and identify useful
information to discover knowledge from the data. But thats easier said than
done. The data to be processed is always high-dimensional, incomplete (though
ironically large), noisy, fuzzy and random. Rigorous and robust mathematical
approaches are needed to discover latent, regular and previously unknown (while
useful and interpretable) insights.
There are generally two types of data mining, descriptive and predictive.
Descriptive mining uses statistics and visualization to describe the observed data
at an aggregated level. Map-reduce is a powerful tool to generate descriptive
statistics about a big data set. In real-time processing of streaming data,
sketching is often used to obtain an approximation of the statistics.
The derived statistics can then be used for feature selection, clustering and
visualization. Predictive mining involves constructing clustering and/or
classification models for the purpose of understanding new data points by
clustering them into a known group or labeling them as a known category.
Feature Engineering
Strictly speaking, feature engineering is a part of data mining. It is a process to
extract, enrich, aggregate and select significant features to represent the original
high-dimensional data set with as little information loss as possible. It is worth

singling out feature engineering because it may provide key indicators for cyber
situational awareness. Quantitative evaluation of relevance or dependency
between any two features is a must. Statistical methods like linear correlation,
Pearsons Chi-Squared test or Fishers exact test are widely used. Mutual
information from information theory is also a powerful tool to compute the
relevance of two variables.
The ultimate goal of feature engineering is to reduce the dimensionality of the
raw data to a manageable level. There are three approaches to take: 1) combine
existing features, 2) select important individual features and 3) a combination of
the first two (first combine, then select). The methods for combining features
include Singular Value Decomposition (SVD), Principal Component Analysis (PCA),
which applies the same technique as SVD but to the covariance matrix, tdistributed Stochastic Neighbor Embedding (t-SNE), self-organizing maps and
others. The methods for selecting a subset of features include forward-selection,
backwardselection, Minimum Redundancy Maximum Relevance (mRMR), feature
clustering based on feature relevance, etc.
Predictive Learning
Predictive learning is an important process for gaining awareness and providing
predictive alerts against potential attacks. The three generalized views of
situational awareness are valuable to leverage predictive learning. The first is
situation recognition, which provides information about the status, attributes and
dynamics of relevant elements within the studied network. Those elements may
include routers, DNS servers, FTP servers, mail servers, database servers, user
computers, mobile devices, domains/IPs being visited, file attachments, etc.
The second view is situation comprehension, which emphasizes the
understanding of network structure and critical assets, communication patterns,
user behavioral profiles, data and control flows, and so on. A variety of predictive
learning processes need to be conducted to generate situation comprehension.
For example, clustering is a common technique to group unlabeled entities
together based on their features. These entities might be files, IPs, services,
users, etc. In the cybersecurity realm, labeled data is rare. However, if it exists,
classification methods can be used to categorize files into malware families,
URLs into benign or malicious domains, and network traffic into normal or
abnormal patterns.
Association rule learning is another important application of data science used to
gain situation comprehension. Although commonly used for identifying related
shopping items for decisions about marketing activities, other uses of association
rule learning are very valuable for identifying network users related behaviors,
relevant domain names, associated malware files, etc. Finally, probabilistic
graphical models are powerful tools to infer the inherent relationship within a set
of random variables. The random variables might be, for instance, date and time,
failed TCP connections, network traffic volume, the pair of source IP and
destination IP, and a cybersecurity company would use probabilistic graphical
modeling to learn the likelihood that observed traffic is abnormal. Well known
graphical models include Bayesian network, Markov Chains and Hidden Markov
Chains.
Learned knowledge would be of no use if it doesnt lead to decisions and actions.
Hence, the third view of situation awareness addresses the projection of situation
into the future. The insights from the first two views of situation awareness have
provided us with a good knowledgebase about what are considered normal
profiles and behaviors, critical assets on a network and what is likely to happen

given a set of observations with quantitative assessment. Situation projection


involves evaluating the risks that the network is exposed to, such as
vulnerabilities in hosts, a particular IPs frequent visits to malicious URLs, unusual
listening ports and so on. The risk evaluation should be conducted both
periodically and in real-time. Furthermore, the situation projection also infiltrates
through decision science (from data science) by constructing an influence
diagram to represent the decision options and uncertainties related to mitigate
the calculated risks in the network environment.

Data Science in Action: Powering Security Solutions


Threat Track Security develops advanced cybersecurity solutions that expose,
analyze and eliminate the latest malicious threats. The product line of
ThreatTrack advanced threat defense includes ThreatSecure Network,
ThreatSecure Email, ThreatAnalyzer and ThreatIQ. ThreatSecure Network collects
network packet data at various vantage points on a network, enriches and
processes the data in real-time, and then stores the data in a private data center.
Depending on the size of the enterprise network, the volume of collected traffic
data may reach terabytes per day.
The streaming data is then aggregated over multiple facets, such as time, IP,
service types and protocols. The Threat Secure Network web based user
interface provides rich information for users to gain situation recognition. Any
exposure of the network device to known malicious IPs or malicious files is
detected in real time, and statistics over time are displayed on the Thression
(detailed threat sessions) dashboard. In addition, ThreatSecure Network can
show the traffic time series with a variety of filters that can be applied and can
display a diagram of the enterprise network showing the connections among
devices, the criticality of devices, and the services that were provided or
consumed in each connection.
Next, we address situation comprehension. In this phase, value-added insights
are generated from the enriched data. We build a comprehensive and dynamic
user profile for each device on the network. The profile covers attributes like
properties (e.g., IP address, host name, MAC address), actions (volume and
frequency of network usage), interests (e.g., services being consumed) and
social relationships (pair of source and destination IPs). Using the features
generated from the user profile, ThreatSecure Network clusters the devices on
the network based on their behavior similarities. This device clustering provides
a valuable summary about the network structure and acts as a foundation for
detecting abnormal behavior. Identified threats can be classified into families
associated with more detailed information. The enriched network traffic data in
the situation recognition phase also enables the modeling of the control flow that
happens among the network devices. For example, IP A uses SSH into IP B, which
in turn accesses IP C for retrieving data. A Markov chain can be trained to model
the probabilistic flow of control. ThreatTrack does not create content to simply
know what has happened but must use this knowledge to predict the future. In
the situation projection phase, ThreatSecure Network applies the learned models
to new information in order to detect network anomalies and uses Bayesian
networks to connect dots to infer the likelihood of each specific risk, such as
virus infiltration, Spyware, keylogging, DDoS, etc. This leads to the risk
evaluation of the enterprise network security. Action items can be identified to

isolate assets in risk, escalate security authentication or patch software


vulnerabilities.
Conclusion
This is the ideal time to marry data science with cybersecurity practices.
Cyberspace is constantly growing, producing the big data used to fuel data
science, the engine that powers the security solutions used to fight
cyberattackers. Without the application of data science, big data is no better
than an inert pile of coal. Data science is a rich discipline that can be used to
tackle all kinds of cybersecurity challenges, many of which are hard to resolve
using traditional methods (such as signature matching, rule system, etc.). A
cybersecurity solution with data science in its core can fully leverage what IBM
dubbed The Four Vs of Big Data (Volume, Velocity, Variety and Veracity) , and
translate that data into the full spectrum of cyber situational awareness. The
next generation of cybersecurity solutions must rely on situational awareness to
proactively defend against threats. The application of data science to
cybersecurity is still relatively new and there are still issues associated with
integrating data science into production. However, there is no doubt that in the
next few years, only those security solutions that are powered by data science
can stand up to the challenge of ever-worsening ferocious attacks and eventually
triumph.
References:
1. Lancope, Dont stretch SIEM beyond its capabilities for contextual security
analytics, March 4, 2015: https://www.lancope.com/blog/dontstretch-siembeyond-its-capabilities-contextual-security-analytics
2. Network World, The two cornerstones of next-generation cybersecurity (part
1), June 13, 2014: http://www.networkworld.com/article/2363311/ciscosubnet/the-two-cornerstones-of-next-generation-cybersecurity-part-1.html
3. CSO, Next-generation cyber security for the future, Aug. 19,
2015:http://www.cso.com.au/article/582348/next-generation-cyber-securityfuture/;
4. Forbes, Envisioning the next generation cybersecurity professional, Nov. 5,
2015: http://www.forbes.com/sites/centurylink/2015/11/05/envisioning-the-nextgeneration-cybersecurity-professional/
5. SC Magazine, Cybercrime-as-a-service the new criminal business model,
Sept. 29, 2014: http://www.scmagazineuk.com/cybercrime-as-aservice-the-newcriminal-business-model/article/374124/
6. KDnuggets, Data science and big data: Two very different beasts, July
2015:http://www.kdnuggets.com/2015/07/data-science-big-datadifferentbeasts.html
7. IBM Big Data & Analytics Hub, The four Vs of big
data:http://www.ibmbigdatahub.com/infographic/four-vs-big-data
8. Gigaom, Notice to startups: You are doing data science wrong, Sept. 28,
2013: https://gigaom.com/2013/09/28/notice-to-startups-you-aredoing-datascience-wrong/

Source URL : https://www.linkedin.com/pulse/data-science-engine-power-nextgeneration-richard-xie

Vous aimerez peut-être aussi