Académique Documents
Professionnel Documents
Culture Documents
Cybersecurity
The content of this blog post has been published as the white-paper at
the ThreatTrack official website. Please go ahead and download a PDF copy.
The high-profile data breaches of recent years paint a grim picture of our
cybersecurity reality. It is well recognized that conventional Security Information
and Event Management (SIEM) solutions and other legacy security products cant
keep up with todays quickly evolving offensive technologies, creating the need
for a new generation of cybersecurity. At the core of that next generation of
cyber-defense technologies will be ensuring that all cybersecurity stakeholders
have greater situational awareness about their network and their adversaries.
Cyber situational awareness is by no means a new concept. In the 2009 book
Cyber Situational Awareness: Issues and Research, the authors define at least
seven awareness aspects that cyber-defenders should have. They are:
1.
2.
3.
4.
5.
6.
7.
Awareness
Awareness
Awareness
Awareness
Awareness
Awareness
Awareness
of
of
of
of
of
of
of
The authors further generalizes the view of cyber situational awareness into
three phases:
1. Situation recognition (including awareness aspects one and six)
2. Situation comprehension (including awareness aspects two, four and five)
3. Situation projection (including awareness aspects three and seven)
Gaining that awareness with legacy technologies is difficult. For large-scale
networks, there is an enormous number of nodes with complex branches
channeling gigantic amounts of data flow, and network environments and
application platforms are also usually heterogeneous. Meanwhile, the
cybercriminal industry is becoming increasingly service-based. The crime-asaservice business model has fueled the trend for higher integration and
automation in cyber-attacks, creating broader, easier access to advanced
malware powering sophisticated, multi-stage cyber-attack campaigns.
How data science boosts situational awareness
Big data is not new to the cybersphere, but it remains misunderstood by many.
Big data is a reality when it comes to securing cyberspace, but not a solution
unto itself. The Google search results for cybersecurity big data are mostly
misleading because many of the authors of the articles confused data science
with big data. Consider that billions of users are on the Internet every day
browsing social media websites, conducting business transactions, transferring
pictures and files, or searching information. Big data focuses on collecting and
managing large amounts of data both in quantity and variety, while data science
aims to extract the underling knowledge and insights from the data.
Phishing is also how a user at IT security company RSA was tricked into opening
a document with malicious code. In a well-crafted APT, a group of RSA employees
were sent a spreadsheet called 2011 Recruitment Plan twice in two days. One
employee took the bait, even retrieving it from the email junk folder. A Zero-day
exploit in the document swiped data from the company.As sophisticated as they
are, APTs still require something as simple as a mouse click to get going. And
even a security company can fall prey to phishing because of peoples propensity
for clicking items in their email.
With advances in big data and data science technologies over the past decade,
full situational awareness in cybersecurity is becoming possible. On top of the
collected raw data, data science is constructing the awareness for detecting
undiscovered malicious actions, and thereby becoming the engine to power
nextgeneration cybersecurity solutions. Data science involves a wide range of
techniques to gain cyber situational awareness. Those techniques include data
fusion, data mining, feature engineering, predictive learning and visualization.
Data Fusion
Situational awareness is feasible only when there is a comprehensive data set
that describes the situation. The data may come from all kinds of devices on the
network, such as routers, servers, workstations, mobile devices, etc, or the log
files on a computer that records every activity that happens on that machine.
This raw data is often unstructured, unorganized and uncleaned. Pre-processing
is required to structurize and normalize the data in a meaningful way for further
processing. Data from multiple sources needs to be correlated and merged.
Fused data provides more realistic and less biased information about the entities
being monitored. The data fusion process can be viewed at three levels: data
level,feature level and decision level. At the data level, raw data is enriched
by meaningfully connecting it to other data. For example, an IP address would
carry more information with whois queries and geo-locations. At the feature
level, the fusion of features provides a more comprehensive description of an
entity. At the decision level, insights are weighed across a set of tools. A wellknown example is VirusTotal, which provides scanning reports from multiple
antivirus agents.
Data Mining
The data fusion process generates a plethora of data stored in data servers. Data
mining is the next step to distinguish signals from noises and identify useful
information to discover knowledge from the data. But thats easier said than
done. The data to be processed is always high-dimensional, incomplete (though
ironically large), noisy, fuzzy and random. Rigorous and robust mathematical
approaches are needed to discover latent, regular and previously unknown (while
useful and interpretable) insights.
There are generally two types of data mining, descriptive and predictive.
Descriptive mining uses statistics and visualization to describe the observed data
at an aggregated level. Map-reduce is a powerful tool to generate descriptive
statistics about a big data set. In real-time processing of streaming data,
sketching is often used to obtain an approximation of the statistics.
The derived statistics can then be used for feature selection, clustering and
visualization. Predictive mining involves constructing clustering and/or
classification models for the purpose of understanding new data points by
clustering them into a known group or labeling them as a known category.
Feature Engineering
Strictly speaking, feature engineering is a part of data mining. It is a process to
extract, enrich, aggregate and select significant features to represent the original
high-dimensional data set with as little information loss as possible. It is worth
singling out feature engineering because it may provide key indicators for cyber
situational awareness. Quantitative evaluation of relevance or dependency
between any two features is a must. Statistical methods like linear correlation,
Pearsons Chi-Squared test or Fishers exact test are widely used. Mutual
information from information theory is also a powerful tool to compute the
relevance of two variables.
The ultimate goal of feature engineering is to reduce the dimensionality of the
raw data to a manageable level. There are three approaches to take: 1) combine
existing features, 2) select important individual features and 3) a combination of
the first two (first combine, then select). The methods for combining features
include Singular Value Decomposition (SVD), Principal Component Analysis (PCA),
which applies the same technique as SVD but to the covariance matrix, tdistributed Stochastic Neighbor Embedding (t-SNE), self-organizing maps and
others. The methods for selecting a subset of features include forward-selection,
backwardselection, Minimum Redundancy Maximum Relevance (mRMR), feature
clustering based on feature relevance, etc.
Predictive Learning
Predictive learning is an important process for gaining awareness and providing
predictive alerts against potential attacks. The three generalized views of
situational awareness are valuable to leverage predictive learning. The first is
situation recognition, which provides information about the status, attributes and
dynamics of relevant elements within the studied network. Those elements may
include routers, DNS servers, FTP servers, mail servers, database servers, user
computers, mobile devices, domains/IPs being visited, file attachments, etc.
The second view is situation comprehension, which emphasizes the
understanding of network structure and critical assets, communication patterns,
user behavioral profiles, data and control flows, and so on. A variety of predictive
learning processes need to be conducted to generate situation comprehension.
For example, clustering is a common technique to group unlabeled entities
together based on their features. These entities might be files, IPs, services,
users, etc. In the cybersecurity realm, labeled data is rare. However, if it exists,
classification methods can be used to categorize files into malware families,
URLs into benign or malicious domains, and network traffic into normal or
abnormal patterns.
Association rule learning is another important application of data science used to
gain situation comprehension. Although commonly used for identifying related
shopping items for decisions about marketing activities, other uses of association
rule learning are very valuable for identifying network users related behaviors,
relevant domain names, associated malware files, etc. Finally, probabilistic
graphical models are powerful tools to infer the inherent relationship within a set
of random variables. The random variables might be, for instance, date and time,
failed TCP connections, network traffic volume, the pair of source IP and
destination IP, and a cybersecurity company would use probabilistic graphical
modeling to learn the likelihood that observed traffic is abnormal. Well known
graphical models include Bayesian network, Markov Chains and Hidden Markov
Chains.
Learned knowledge would be of no use if it doesnt lead to decisions and actions.
Hence, the third view of situation awareness addresses the projection of situation
into the future. The insights from the first two views of situation awareness have
provided us with a good knowledgebase about what are considered normal
profiles and behaviors, critical assets on a network and what is likely to happen