Vous êtes sur la page 1sur 6

Writing your own Spark Plugins

===================

This document describes how to create your own Spark Plugin Code to tackle a new
Use Case. All Plugin Code resides in the `conf` folder of datashark.

To write a new plugin, at the bare minimum, we need 2 files:


> 1. The **.conf file**
> 2. A code **.py file**

The .conf file needs to define a few necessary flags specific to the plugin and the
.py file needs to implement the `load` function.

The .conf File


-------------------This is what a standard .conf file looks like:
```
name = Use Case Name
type = streaming
code = code.py
enabled = true
training = training.log
output = elasticsearch
[log_filter]
[[include]]
some_key = ^regex pattern$
[[exclude]]
some_other_key = ^regex pattern$
[elasticsearch]
host = 10.0.0.1

port = 9200
index_name = logstash-index
doc_type = docs
pkey = sourceip
score = anomaly_score
title = Name for ES Document
debug = false
```

### Required Keys


`name` : Name for the Use Case.
`type` : This can be either of **batch** or **streaming**. When using *batch* mode,
the use case is run just once on the provided data file. In *streaming* mode a Kafka
Stream is passed to the code file to analyze.
`enabled` : Set this to either **true** or **false** to simply enable or disable this
use case.
`code` : The .py file corresponding to this use case.
`training` : The log file to supply as training data to train your model. This is
required only when `type = streaming`
`file` : The data file to use for *batch* processing. This is required only when `type
= batch`
`output` : The output plugin to use. Types of output plugins are listed [here]
(#output-plugins)
`[type_of_plugin]` : The settings for the output plugin being used.

### Optional Keys


`[log_filter]` : This is used to filter out the kafka stream passed to your use case. It
has the following two optional sub-sections:
- `[[include]]` : In this sub-section each key value pair is used to filter the incoming
log stream to include in the use case. The *key* is the name of the key in the JSON
Document in the Kafka Stream. The *value* has to be a regex pattern that matches
the content of that key.
- `[[exclude]]` : In this sub-section each key value pair is used to filter the incoming
log stream to exclude from the use case. The *key* is the name of the key in the

JSON Document in the Kafka Stream. The *value* has to be a regex pattern that
matches the content of that key.

The .py file


---------------The .py file is the brains of the system. This is where all the map-reduce. model
training happens. The user needs to implement a method named `load` in this .py
file. dataShark provides 2 flavors of the load function to implement, one for
streaming and one for batch processing. Following is the basic definition of the load
function of each type:

*For batch processing:*


```python
def load(batchData)
```
The data file provided as input in the .conf file is loaded and passed to the function
in the variable batchData. batchData is of type `PythonRDD`.

*For stream processing:*


```python
def load(streamingData, trainingData, context)
```
`streamingData` is the Kafka Stream being sent to the function load. This is of type
*DStream*.
`trainingData` is the Training File loaded from the *training* key mention in the
.conf file. It is of the type *PythonRDD*.
`context` is the spark context loaded in the driver. It may be used for using the
accumulator, etc.

The function `load` expects a processed *DStream* to be returned from it. Each
RDD in the DStream should be in the following format (this format is necessary for
usability in output plugins.):
`('primary_key', anomaly_score, {"some_metadata": "dictionary here"})`

*primary_key* is a string. It is the tagging metric by which the data was aggregated
for map-reduce and finally scored.
*anomaly_score* is of type float. It is the value used to define the deviation from
normal behavior.
*metadata* is of the type dictionary. This is the extra data that needs to be inserted
into the Elasticsearch document or added to the CSV as extra Columns.

-------------------

Output Plugins
=============
dataShark provides the following 3 output plugins out-of-the-box for storing data:

> 1. Elasticsearch
> 2. Syslog
> 3. CSV

Each of this plugin requires its own basic set of settings, described below.

### 1. Elasticsearch Output Plugin

The Elasticsearch output plugin allows you to easily push JSON documents to your
Elasticsearch Node. This allows users to build visualizations using Kibana over
processed data.

Following is the basic template for configuring elasticsearch output plugin:

```text
output = elasticsearch
[elasticsearch]
host = 127.0.01

port = 9200
index_name = usecase
doc_type = spark-driver
pkey = source_ip
score = anomaly_score
title = Use Case
debug = false
```
All settings in the config are optional. Their default values are displayed in the
config above.

`host` : Host IP or Hostname or the ES server.


`port` : Port Number of ES Server.
`index_name` : Name of the index to push documents to for this use case.
`doc_type` : Document Type Name for this use case.
`pkey` : Primary Key Field name to show in ES Document.
`score` : Anomaly Score Key Field Name to show in ES Document.
`title` : The value of the title field in the ES Document.
`debug` : Set this to true to display each JSON record being push to ES on the
console.

### 2. Syslog Output Plugin

The Syslog Output plugin outputs JSON documents to the specified Syslog Server IP
and Port. Following is the sample configuration with default settings for the plugin
(all settings are optional):

```
output = syslog
[syslog]
host = 127.0.0.1

port = 514
pkey = source_ip
score = anomaly_score
title = Use Case Title
debug = false
```
The settings are similar to that of elasticsearch.

### 3. CSV Output Plugin

The CSV Output Plugins writes and appends output from Spark Use Case to a
specified CSV File.Following is the sample configuration with default settings of the
plugin (all settings are optional):

```
output = csv
[csv]
path = UseCase.csv
separator = ,
quote_char = '"'
title = Use Case
debug = false
```

Vous aimerez peut-être aussi