DataSift User Guide

User Guide
June 2014

Copyright Statement
Copyright 2000-2014 DataSift
All Rights Reserved

This document, as well as the software described herein, are
protected by copyright laws, and are proprietary to DataSift.

Disclaimer
The information in this document is subject to change without
notice and should not be construed as a commitment by DataSift.
DataSift assumes no responsibility for any errors that may appear
in this document.

Trademarks and Patents
DataSift and the DataSift logo are trademarks of DataSift and may
be registered in some jurisdictions. All other marks are the
trademarks of their respective owners.
Because of the nature of this material, numerous hardware and
software products are mentioned by name. In most, if not all
cases, these product names are claimed as trademarks by the
companies that manufacture the products. It is not the intent of
DataSift to claim these names or trademarks as its own.

Copyright DataSift. All Rights Reserved. i
Contents
1 DataSift Overview ........................................................................................................... 1
Social Media ............................................................................................................................................. 1
DataSift Platform ..................................................................................................................................... 3
Stage One Aggregation ........................................................................................................................ 3
Stage Two Processing .......................................................................................................................... 4
Stage Three Delivery ............................................................................................................................ 7
Historic Streams ...................................................................................................................................... 9
Billing ........................................................................................................................................................ 9
Registering an Account ......................................................................................................................... 10
Web Application Interface .................................................................................................................... 12
2 Configuring Sources ..................................................................................................... 15
Finding Sources ..................................................................................................................................... 15
Source Types .......................................................................................................................................... 15
Viewing Source Information ................................................................................................................ 18
Activating Sources ................................................................................................................................. 20
Source Activation Impact ..................................................................................................................... 22
3 Configuring Streams in Query Builder ....................................................................... 23
Query Builder ......................................................................................................................................... 23
Enabling Sources ................................................................................................................................... 24
Creating New Streams .......................................................................................................................... 25
Creating Simple Filters .......................................................................................................................... 26
Reviewing Filter Cost ............................................................................................................................. 29
Previewing Streams .............................................................................................................................. 31
Creating Multiple Filter Conditions ..................................................................................................... 33
Using Logical Operators ....................................................................................................................... 36
Embedding & Customizing Query Builder ......................................................................................... 37
4 Analyzing Interactions ................................................................................................. 39
Displaying Interaction Details .............................................................................................................. 39
Analyzing Interaction Details Web Application............................................................................... 41
Analyzing Interaction Details API...................................................................................................... 47
5 Writing Simple Filters in CSDL ..................................................................................... 49
Filtering Condition Elements ............................................................................................................... 49
Selecting Targets ................................................................................................................................... 49
Selecting Operators .............................................................................................................................. 51
Using Multiple Conditions .................................................................................................................... 57
Hints & Tips ............................................................................................................................................ 58
6 Configuring Streams CSDL Web Application ........................................................... 59
Writing Filters with CSDL Editor........................................................................................................... 60
Validating Filters .................................................................................................................................... 67
Compiling Filters .................................................................................................................................... 68
7 Configuring Categorization ......................................................................................... 71
Configuring Tagging .............................................................................................................................. 72
Configuring Tag Namespaces .............................................................................................................. 74

ii Copyright DataSift. All Rights Reserved.
Configuring Scoring ............................................................................................................................... 77
Configuring Cascading .......................................................................................................................... 80
Including Library Classifiers ................................................................................................................. 82
Billing ...................................................................................................................................................... 85
8 Configuring Streams API ........................................................................................... 87
Making API Calls .................................................................................................................................... 88
Validating Filters .................................................................................................................................... 88
Compiling Filters .................................................................................................................................... 90
9 Configuring Stream Recording .................................................................................... 93
Data Destinations .................................................................................................................................. 93
Starting Record Tasks ........................................................................................................................... 94
Viewing Record Tasks ........................................................................................................................... 97
Pausing Record Tasks ........................................................................................................................... 98
Stopping Record Tasks ......................................................................................................................... 98
Exporting Record Task Data................................................................................................................. 99
Deleting Record Tasks ........................................................................................................................ 103
10 Configuring Historic Previews ................................................................................. 105
Historic Archive .................................................................................................................................... 105
Historic Preview ................................................................................................................................... 105
Report Types ........................................................................................................................................ 106
Historic Preview Billing ....................................................................................................................... 113
Configuring Historic Preview ............................................................................................................. 114
Downloading Reports ......................................................................................................................... 117
11 Configuring Historic Stream Recording ................................................................. 119
Historic Tasks ....................................................................................................................................... 119
Data Destinations ................................................................................................................................ 120
Starting Historic Tasks ........................................................................................................................ 121
Viewing Historic Tasks ........................................................................................................................ 124
Pausing Historic Tasks ........................................................................................................................ 125
Stopping Historic Tasks ...................................................................................................................... 125
Deleting Historic Tasks ....................................................................................................................... 126
12 Configuring Destinations Amazon S3 .................................................................. 127
Configuring Amazon S3 for DataSift ................................................................................................. 127
Configuring DataSift for Amazon S3 ................................................................................................. 134
13 Configuring Destinations Google BigQuery ........................................................ 139
Google BigQuery ................................................................................................................................. 139
Configuring Google BigQuery for DataSift ....................................................................................... 141
Configuring DataSift Web Application for Google BigQuery ......................................................... 149
Configuring DataSift API for Google BigQuery ................................................................................ 154
Querying Data in BigQuery ................................................................................................................ 155
Deleting Google Cloud Projects ......................................................................................................... 157
14 Configuring Push Delivery API .............................................................................. 161
Push Delivery Workflow ..................................................................................................................... 161
Locating API Credentials ..................................................................................................................... 161
Validating Push Destinations ............................................................................................................. 162
Creating Push Subscriptions .............................................................................................................. 163

Copyright DataSift. All Rights Reserved. iii
Checking Push Subscriptions ............................................................................................................. 166
Retrieving Push Subscription Logs .................................................................................................... 167
Pausing Push Subscriptions ............................................................................................................... 169
Resuming Push Subscriptions ........................................................................................................... 169
Stopping Push Subscriptions ............................................................................................................. 170
Deleting Push Subscriptions .............................................................................................................. 170
15 Configuring Destinations MySQL ......................................................................... 171
Configuring Amazon RDS ................................................................................................................... 171
Configuring Databases ....................................................................................................................... 181
Configuring Database Tables............................................................................................................. 183
Configuring Mapping .......................................................................................................................... 185
Configuring DataSift Destination (API) ............................................................................................. 187
Configuring DataSift Destination (Web Application) ...................................................................... 192
16 Monitor Usage & Billing ........................................................................................... 195
Subscription Model ............................................................................................................................. 195
Modifying Billing Account Details ...................................................................................................... 196
Viewing Usage & Balance ................................................................................................................... 197
Setting Overage Limit ......................................................................................................................... 198
Viewing Usage Statistics ..................................................................................................................... 199
Viewing Current Streams ................................................................................................................... 203
Locating Pricing Information ............................................................................................................. 203
17 Locating Help ............................................................................................................ 205
Locating Documentation .................................................................................................................... 205
Viewing Platform Status ..................................................................................................................... 207
Viewing Known Issues ........................................................................................................................ 208
Forum Discussions .............................................................................................................................. 208
Subscribing to Updates ...................................................................................................................... 209
Attending Workshops ......................................................................................................................... 209
Submitting Support Requests............................................................................................................ 210
Viewing Screencasts ............................................................................................................................ 211
Attending Training ............................................................................................................................... 211

iv Copyright DataSift. All Rights Reserved.

(This page intentionally left blank)

Copyright DataSift. All Rights Reserved. 1
1 DataSift Overview
DataSift is a powerful cloud platform for extracting value from social networks, blogs
and forums. It achieves this by capturing, augmenting, filtering and delivering social
media interactions.
Social Media
Social media interactions come from a wide range of sources which include Twitter,
Facebook, blogs, message boards, social news, comments and forums.
Every day millions of new social interactions are created and shared across the
networks. More than a billion people are sharing information which include comments
on brands, products and services.

Figure 1: Social Media Interaction Figures
Consumers across all age groups spend more time on social networks than on any
other category of site.
50% of people are recommending brands on social networks
90% of people are listening to recommendations on social networks
50% of people will see a companys social presence before the corporate web
site
Before the rise of social media, interactions were one-to-one or to small groups.
Whereas social interactions are typically many-to-many with some users having
considerable influence.
Social interactions are real time, come from many different sources and may be seen by
millions of people.

2 Copyright DataSift. All Rights Reserved.
Business Challenges
Many companies build a social network presence which is used for marketing and
advertising. However, they may not have the ability to manage vast quantities of social
network interactions which come from many different sources, have different formats
and are presented without any context.
Social interactions require filtering and analysis before they can be used to inform
business decisions.
Use Cases
Examples of how business can use social media data include:
Social customer service
Reputation management
o Track conversations and content around a brand
o Identify trends
Competitive analysis
o Measure share of social activity vs. the competition
Marketing planning & measurement
o Identify content which is generating most interest and engagement
Identify social influencers
o Identify and rank influencers of brand and completion brands
Identify social advocates

DataSift Platform
The DataSift platform uses a three-stage process to derive value from social media
Interactions.
1. Aggregate interactions from multiple sources
2. Process the interactions by normalizing the structure, adding meta-data and
filtering based on user-defined parameters
3. Deliver a stream of filtered interactions to a destination
This allows relevant data, in a standard and enhanced format, to be integrated into
products and applications for analysis. Streams can be delivered from live updates or
from an archive of historic interactions.
Stage One Aggregation
Interaction data is captured from a growing list of sources.
Twitter Intense Instagram Topix
Tumblr Debate NewsCred IMDb
Facebook Google+ Reddit Videos
Sina Weibo YouTube Wikipedia Blogs
WordPress Bitly DailyMotion Boards

DataSift has agreements with many social networks to gather all public interactions.
Billions of social interactions are captured at a rate of over 15,000 per second.
It is also possible to gather interactions from networks and sites which require a login.
These are called Managed Sources
Managed Sources
Companies often have social media presence on networks such as Facebook and
Google+ relating to their brand or products. To capture the interactions on these private
pages and sites, the company provides DataSift with revocable access tokens. DataSift
use the tokens to collect interactions.
For simplicity of filtering, interactions from managed sources are aggregated with all
other sources before augmentation and filtering. The number of managed sources is
growing and includes the following:
Facebook Pages Instagram
Google+ Yammer

Stage Two Processing
Normalization
Normalization is the process of standardizing interaction format without losing any of
the information unique to particular networks. This greatly simplifies the job of
managing and analyzing the stream of filtered interactions.
Augmentation
Augmentation is the process of adding extra information to each interaction. The
augmented interaction provides the context information for an interaction which allows
more sophisticated filtering and analysis.
Demographic & Gender
o Interactions are augmented with demographic and gender information
allowing companies to build a dynamic picture of a market segment.
EXAMPLE:
Starbucks demographic and gender analysis:
http://demo.datasift.com/coffee/
Klout
o Klout score is a measure of an authors influence in the online
community. The scale is from 1 to 100 with 100 being the most influential.
Klout scores are provided by klout.com who analyze 400 signals from
eight networks every day.
Language
o A language detector recognizes over 140 languages and provides an ISO-
639-1 language code for interactions.
o Detected languages are provided with a confidence rating percentage.
Link
o Links in social interactions are often shortened and may be redirected.
The DataSift platform resolves links to their final destination.
Salience & Sentiment
o The salience augmentation lists the entities found in a post. Entities are
typically proper nouns such as people, companies, products, and places.
o Sentiment scoring allows the rating of positive or negative assertions.

Categorization
Categorization is applying user-defined tags to interactions. These can be used when
processing and analyzing the output stream. Tagging is a hierarchical namespace, for
example tags could include "tag.device.name", "tag.device.type",
"tag.device.type.model".
Scores can be applied to tags which are cumulative. As interactions are matched by
filter conditions, tags scores can be incremented or decremented.
Filtering
Using more than 300 unique fields and advanced operators, one or more filters can be
defined to generate streams of relevant interactions. Two methods are available for
creating filters: a visual tool called Query Builder and a programming language called
Curated Stream Definition Language (CSDL).
Query Builder
Query Builder is a web application allowing users to create and edit filters without them
having to learn CSDL. An instance of Query Builder is provided on the DataSift
dashboard.

Figure 2: Query Builder Example

Curated Stream Definition Language (CSDL)
CSDL is a simple programming language which allows a filter to be created using
targets, operators, arguments and logic. Custom tags and scores can be added to the
interactions which are passed by the filter to the output stream.
The following CSDL has the same effect as the Query Builder example.

Figure 3: CSDL Example
CSDL can be written and compiled using an editor on the DataSift dashboard or
submitted for compilation by Application Programing Interface (API) calls.

Stage Three Delivery
The augmented and filtered interactions make the output stream. The stream can be
delivered to pre-configured destinations or is available for real-time consumption.
Destinations
The DataSift platform can be configured to send the output stream to a growing
number of destinations which are configured through the web application or API.

Figure 4: Example Destinations
Using Amazon Simple Storage Service (S3) as an example, the destination is configured
in the DataSift platform with user credentials, keys, storage container, folder and file
details. The amount and frequency of data delivery is configured to ensure none is lost.
Configuration is by web application or API.
Delivery of data is buffered up to one hour and delivery is guaranteed.

Streaming
Using the API, your stream of augmented and filtered interactions is available in real-
time. A connection is made between your application and the DataSift platform using
HTTP or websockets.
Output Format
Each interaction in the stream is represented as attribute-value pairs in JavaScript
Object Notation (JSON) objects. Some of the attributes are the same for all
interactions, others are specific to the interaction source. For example, retweet count
would not be available in a Facebook JSON object.

Figure 5: Example Twitter JSON Object (excerpt)

Historic Streams
Interactions from many sources are archived which allows filters to be run against
historic data for a user-defined period. It is possible to estimate data volume and job
completion time by running filters a 1% or 10% sample of historic data. Data delivery
from the archive happens with in minutes or hours.
Billing
There are on-demand and subscription payment models. Billing for platform usage is
based the amount of processing required to generate the stream and a license fee for
each interaction in the stream.
Platform Fees
Platform fees are charges as Data Processing Unit (DPUs). A single DPU is equivalent to
the processing that can be carried out in one hour by a single standardized processor.
Complex filters require more processing than simple filters so each filter is
automatically assigned a DPU rating. Running a filter rated at 3 DPU hours for 10 hours,
results in 30 DPU hour usage.
Data Fees
There is a charge for each interaction which is delivered in the stream from your filter. A
single stream may contain licensing charges for interactions from many sources. Each
source must be manually enabled and a license agreement must be signed.

Registering an Account
To register a DataSift platform account, open a web browser and go to
http://datasift.com. Click the Login link.

Figure 6: datasift.com Login link
A Login and Register tab are shown. Click the Register tab.

Figure 7: Register Tab
Complete the form fields to register a new DataSift account or link a new DataSift
account to an existing account from any of the networks shown. By linking your DataSift
account, it allows login to the DataSift platform using single sign-on with another
networks credentials.

Figure 8: New Account Options

If completing the form, ensure the username only contains letters, numbers, periods
and underscores. Click the terms and conditions link to view the terms and conditions
in a new window. If you agree, select I agree and click the create button.

Figure 9: DataSift Account Form

Web Application Interface
When the new streams dialogue has been completed or skipped, the web application
interface is displayed with default settings explained below.
Account Details
The top of the page displays account information including a link to account settings,
number of unread notifications and license cost.

Figure 10: Account Details
NOTE:
This may look different when using a Pay-as-you-go (PAYG) account
Tips
Tips are displayed at the top of each page until they are dismissed. Tips are used to
provide assistance and suggestions. The new account has a tip suggesting the Twitter
data source is enabled.

Figure 11: Tips
Notifications
Clicking the notifications icon shows unread notifications from the platform which may
include information about jobs which have completed or billing.

Figure 12: Notifications

Adding Connected Identities
A variety of other online identities can be linked to the DataSift account to provide
single-sign on using the other account's credentials.
Click Settings on the top navigation bar and click Identities. Add identities from the list
on the right.

Figure 13: Adding Connected Identities
Pre-Configured Streams

Figure 14: Streams Tab
The web application has multiple pages identified by the tabs shown. New accounts
start with the Streams page open.
Filters are configured in the Streams page. These filters can be used to create streams
of interactions which match the filter conditions.

Figure 15: Pre-Configured Streams
How to edit and use these filters is covered in later modules.

Dashboard

Figure 16: Dashboard Tab
The dashboard page displays the notifications mentioned previously along with a list of
configured stream filters.

Figure 17: Dashboard - Details
The lower half of the page shows API usage divided into the Sources & Augmentations
used, and the number of hours.

Figure 18: Dashboard - API Usage

2 Configuring Sources
Sources are the social networks, news sites, forums, comments, message boards, blogs
and other networks which provide input interactions for the DataSift platform. This
section explains the source types and how to find detailed information on each source.
It also demonstrates data source configuration.
Finding Sources
Data sources are listed in the DataSift web application. Log in to http://datasift.com
and select the Data Sources tab. New sources appear here automatically.

Figure 19: Data Sources Page
Source Types
Sources are divided into three types: Feeds, Managed Sources, and Augmentations.
Feeds
Feeds are the most common type. They are identified on the Data Sources page with
light blue tabs in the corner.

Figure 20: Example Feed Source

Some feeds are from sources which send interactions to the DataSift platform as they
occur. Twitter is a good example of this type. There is a firehose of interactions coming
from Twitter directly to the DataSift platform. Tumblr is another example of this type.
Other feeds are from sources which are constantly monitored by the DataSift platform.
In either case, interactions become input to the platform and available for filtering.
Use the links on the left to further filter the choice of sources.

Figure 21: Selecting Social Network Sources
Managed Sources
Companies often create social media presence on various networks. These may include
Facebook, and Google+. Interactions on these pages and sites are not always public,
they are protected by login credentials.
Managed sources allow a customer to include interactions from closed pages or public
pages which require a login before they can be accessed. This allows all interactions
from feeds and managed sources to be filtered as a single stream.
The company acquires an access token from the source which is used by the DataSift
platform to retrieve interactions. The key can be revoked at any time and interactions
from managed sources remain private. Managed Sources are available on the Managed
Sources page of the Data Sources tab.

Figure 22: Managed Sources Page

Augmentations
Interactions from data sources contain a lot of information specific to that data source
and the content of the interaction. However, there is more information which can be
retrieved and calculated by the DataSift Platform which enriches the information in the
interaction.
Augmentations are identified on the Data Sources page with purple tabs in the corner.

Figure 23: Example Augmentation Source
Information added by augmentations may include language, positive or negative
sentiment in a message, topics, trends, social media influence, gender, and many more.
Augmentations are applied to all applicable interactions in the same way, allowing a
single stream of interactions to be generated from the filter.

Figure 24: Augmentation of all Interactions
Many of the data feeds share common features such as the main text, the creation
time, the author, and frequently links to other online content. The Interaction
augmentation also groups these features together making it easier to write filters
across all the feeds.

Viewing Source Information
Each source has a page with a detailed description. From the Sources or Managed
Sources pages, click the source name or logo.

Figure 25: Open Source Description
In this example of the Facebook public source, information about the number of
interactions and a breakdown of types and languages is shown.

Figure 26: Example Source Description

Pricing
In each source description, the price for using the source is shown. This is the data
licensing price for each interaction returned by your filter from this source. It is in
addition to the Data Processing Unit (DPU) hours charged for running the filter.

Figure 27: Example Data License Pricing
Target Fields
A target is something which can be used in a filter. Each source may contain many tens
of targets. Under the source description, the source targets are shown. This can be used
when selecting sources to identify what fields to use in a filter.

Figure 28: Example Target Fields
Sample Definition
The sample definition is an example of how the target fields can be used in a Curated
Stream Definition Language (CSDL) filter. In this example, the CSDL matches Tweets
containing the words San Francisco which have been retweeted more than nine times
by users with over 999 followers or at least 10 times as many people following them as
they follow.

Figure 29: Twitter Sample Definition

Example Output
When an interaction has been matched by a filter, the interaction data, any augmented
data and any extra tagging information added by the filter are sent to a destination or
streamed.
The example output shown on the source page is an example of an output interaction
formatted as a JavaScript Object Notation (JSON) object.

Figure 30: Example Output
The actual attribute-value pairs in the JSON object will vary depending on which
augmentations are enabled.
Activating Sources
Some sources are automatically activated when a new account is registered. Others
require activation. Previous modules have covered Twitter as an example of an inactive
source which requires a license to be signed before it is activated. From the Data
Sources page, use the Activate and Deactivate buttons.

Figure 31: Source Activate and Deactivate Buttons
Some sources are restricted to accounts on premium subscriptions. These are displayed
with an Enquire button. The Enquire button opens a web form to request access to the
source.


Figure 32: Enquire Button on a Premium Source
NOTE:
When activated, demographics anonymizes all interactions.
Verifying Source State
The Data Sources page shows which sources are active or inactive. Sources with a gray
Deactivate button are already activated.

Figure 33: Verifying Source State
Activating Augmentations
Augmentation sources are activated in the same way as feed sources.
Activating Managed Sources
Managed sources can be defined multiple times, for example a customer may have
more than one Google+ page to use as an interaction source. In this screenshot, two
Google+ pages are being used as sources. Defined instances are listed under My
Managed Sources.

Figure 34: Two Instances of a Managed Source

On the Managed Sources page, the Google+ source doesnt have an
Activate/Deactivate button. The + symbol is used to create another instance of the
managed source.

Figure 35: Managed Source Available for New Instances
Source Activation Impact
When a feed is activated, the interactions become available immediately to all running
filters. Any running filters which would match interactions from the new feed will then
produce those interactions in the output stream.
As soon as the new feed is activated, the data license fee for interactions from the new
feed is being charged.

3 Configuring Streams in Query
Builder
This section describes the Query Builder editor and walks through the process of
creating, validating and compiling filters with Query Builder.
Query Builder
Query Builder is a simple and powerful web application for creating and editing filters.
The alternative to Query Builder is writing filters using the Curated Stream Definition
Language (CSDL). An instance of Query Builder is provided on the DataSift dashboard.

Figure 36: Choosing Query Builder Editor
HINT:
Query Builder filters can be converted into CSDL with one click.

Enabling Sources
Creating filters in Query Builder allows interactions from a number of data sources to
be filtered. Before creating the filter, choose which sources are used and ensure they
are enabled. Sources are divided into three types shown below.
Source Type Description Example
Feeds Public sources of interactions Twitter
Tumblr
Reddit
Managed
Sources
Sources of interactions which require
authentication
Facebook
Pages
Google+
Instagram
Augmentations Extra information retrieved or calculated about
the interaction
Demographics
Sentiment
Language

Enable and disable Sources in the Data Sources page of the dashboard.

Figure 37: Location of Data Sources Configuration
Refer to the training module on Configuring Sources for more detail on managing
sources.
Creating New Streams
Query Builder is an editor which creates conditions to filter data from the enabled data
sources. To use Query Builder, click the Create Stream button on the Streams page.

Figure 38: Creating a new Stream

Enter a Name for the new stream and optionally, a Description. In this example, the
filter name is Starbucks Filter. There are two choices of editor, Query Builder and CSDL
Code Editor. Select Query Builder and click Start Editing.

Figure 39: Selecting Query Builder Editor
TIP
Multiple streams can have the same name so devise your own naming
convention.
Creating Simple Filters
Query Builder allows multiple filter conditions to be combined to generate the content
for an output stream. The new stream definition opens with no filter conditions. Click
Create New Filter to create the first filter condition.

Figure 40: Create New Filter Condition

Creating Filter One
A list of enabled sources is shown. Use the left and right arrows to scroll for more
sources. If a required source is not visible, it may not be enabled.

Figure 41: Choose Sources
When a source is chosen, the targets within the source are shown. In this example, the
filter condition will be applied to the TWEET message from the Twitter source.

Figure 42: Choose Targets

Depending on the source and target chosen, there may be more target layers to select
from. In the example of filtering on the message in a tweet, the text and an operator
must be selected.

Figure 43: Contains Words Condition
With this target, there are multiple operators to allow matching of text strings in a
variety of ways.

Figure 44: Text Operators
When entering multiple words for Contains or Contains words near, hit enter between
each word as shown below. Text strings are not case sensitive unless the Aa button is
selected.

Figure 45: Entering Multiple Text Strings

When the filter condition is complete, click Save and Preview. A summary of the new
filter condition is shown.

Figure 46: Summary of Filter Condition
Reviewing Filter Cost
Pricing is made up from the Data Processing Unit (DPUs) and a data licensing cost per
interaction in the output stream. More complicated filter conditions consume more DPU
hours but may also reduce the data licensing cost by being more specific.
To review the cost of a filter, open the Streams page and click on the stream name.

Figure 47: Select Existing Stream

The cumulative DPU cost for the filter is shown on the left. At the bottom of the page is
a Stream Breakdown showing a breakdown of the DPU usage for the stream.

Figure 48: Review DPU Cost
The following example shows the DPU breakdown for multiple filter conditions.

Figure 49: DPU Breakdown for Multiple Filter Conditions
The cost of a DPU may change over time. DPUs and pricing is covered on a separate
training module.

Previewing Streams
The stream preview is used to review and fine-tune filters by identifying irrelevant
interactions in the output stream. The filter conditions can then be modified to exclude
them.
From the Streams page, click on the stream name to open the summary page. Click Live
Preview.

Figure 50: Live Preview
A summary of enabled sources is shown. Check that only the required sources for this
stream and any other running streams are in the list. Enabling more than the required
sources may unnecessarily increase the data licensing cost.

Figure 51: Sources Summary
NOTE:
DPU cost is incurred when using the live preview.

Click the play button at the bottom of the screen to start previewing interactions
matched by the filter conditions.

Figure 52: Play Button
Interactions with their augmentation icons and information appear as they are
matched.

Figure 53: Example Preview Interactions
NOTE:
The number of interactions sent per second is limited in preview mode.
The total number of interactions sent may also be limited
To stop the preview, click on an interaction or click the pause button.

Figure 54: Pause Button

Creating Multiple Filter Conditions
Filtering becomes much more powerful when multiple conditions are combined. In this
example, two more filter conditions are added to the Starbucks stream. Click on the
stream name to open the stream summary.

Figure 55: Edit Stream
Click the Create New Filter button to add a new filter condition.

Figure 56: Add filter Conditions

Creating Filter Two
Add a filter condition for a Klout score over 40.

Figure 57: Klout over 40

Creating Filter Three
Add a third filter condition for any level of positive sentiment in interaction content.

Figure 58: Above Zero Sentiment

All three filters are now listed in the filter preview.

Figure 59: Three Filters in Filter Preview
Using Logical Operators
Above the filter descriptions are three options for the logic which should be applied to
the filter conditions. The default is ALL of the following.

Figure 60: Filter Condition Logic
If ALL of the following is selected, a logical AND is used between the conditions so all
three conditions must be matched in an input interaction for it to be sent to the output
stream.
If ANY of the following is selected, a logical OR is used between the conditions so if an
input interaction is matched by one or more conditions, the interaction is sent to the
output stream.

Enabling Advanced Logic
The ADVANCED selection expands the logic to show what is currently applied and
allows more complex logic to be applied. The current example shows logical ANDs
between each condition.

Figure 61: Advanced Logic Expanded
Grouping Conditions
If using different logical operators between conditions, make use of brackets to make it
clear which conditions are grouped. In this example, condition 1 must be matched along
with either of conditions 2 or 3. Conditions, operators and brackets can be dragged to
the required position.

Figure 62: Brackets and Dragging in Advanced Logic
Negating Conditions
Use the NOT operator to negate a condition or group of conditions. In this example, the
interaction is sent to the output stream if condition 1 is matched and neither condition
2 or 3 is matched.

Figure 63: Negated Conditions
Embedding & Customizing Query Builder
The Query Builder is available under open source license as an independent,
embeddable module written in JavaScript. Anyone can add it to their own web page,
blog, or a web view in a desktop or mobile application.
Documentation and examples for embedding and customizing Query Builder are
available at http://dev.datasift.com/editor



4 Analyzing Interactions
Interactions contain data from the original source as well as augmentation data
provided by the DataSift Platform. This section shows how to analyze the details of an
output interaction using the Web Application.
Displaying Interaction Details
Interaction details are available from the live preview of a stream. From the Streams
page, click on a stream name to display the stream summary.
Click Live Preview.

Figure 64: Stream Summary Page
Review the summary of Live Preview sources to ensure the correct sources are enabled.

Figure 65: Live Preview Sources

Click the play button to start previewing interactions matched by the filter conditions.

Figure 66: Live Preview Play Button
Pause the preview when there are interactions to analyze by clicking the pause button.
Clicking an interaction also pauses the preview.

Figure 67: Live Preview Pause Button
Move the mouse pointer over the interaction to be analyzed. The interaction becomes
grayed and a bug icon appears.

Figure 68: Bug Icon on an Interaction
Click anywhere on the interaction to open the debug viewer.

Figure 69: Interaction Debug Viewer

Analyzing Interaction Details Web Application
Interaction information is available as icons below the message and in a debug window.
Augmentation Icons
Some interactions have icons displayed under the message summary. These mostly
relate to information available from augmentations. If they are not visible, the necessary
augmentations may not be enabled.
The following example shows the users avatar with a Klout score of 46 and the message
content has positive sentiment.

Figure 70: Augmentation Icons

Interaction Example
Interaction details are available for all interactions regardless of source. While each data
source or managed source has its own attributes, the interaction augmentation
provides a consistent set of attributes regardless of the source type.
Information about the author is separated from the message information.

Figure 71: Example Interaction Output
Not all attributes are available for every data source. For example, some data sources
may not have an avatar.
REFERENCE:
The interaction targets are documented here:
http://dev.datasift.com/docs/targets/common-interaction

Klout Example
With this augmentation enabled, each interaction which matches the filter conditions is
augmented with the authors Klout score. Klout is a value for the authors social media
influence. Values are on a scale of 1-100 with higher values being more influential.

Figure 72: Klout Score Interaction Output
REFERENCE:
Klout targets are documented here:
http://dev.datasift.com/docs/targets/augmentation-klout

Language Example
With language augmentation enabled, the language of the message is calculated along
with a value indicating the level of confidence that the language has been identified
correctly. In this example from a different Tweet, the DataSift platform is 100% sure the
message is in English.

Figure 73: Language Interaction Output
It may not be possible to calculate language for every interaction. In which case, the
language section is excluded from the output.
REFERENCE:
The language augmentation and targets are documented here:
http://dev.datasift.com/docs/targets/augmentation-language

Salience Example
If the Salience augmentation is enabled, it adds topics and sentiment to the interaction.
In this example from a different Tweet, two topics have been discovered which are both
companies. In each case, the sentiment of the message about the companies is
negative.

Figure 74: Salience Interaction Output
REFERENCE:
Salience documentation is available here:
http://dev.datasift.com/docs/targets/augmentation-salience

Twitter Example
Interactions matched in the Twitter feed contain attributes and values from Twitter. The
attributes and meanings of each attribute are defined by Twitter and may change.

Figure 75: Twitter Interaction Output

More detailed information about the user is available under the user attribute.

Figure 76: Twitter User Details

Analyzing Interaction Details API
When consuming interactions programmatically, the interactions arrive as JavaScript
Object Notation (JSON) objects. This is an open standard format providing data as text
attribute-value pairs.
The following example shows a single Tweet matched by a filter and delivered as a JSON
object in a stream from the DataSift platform:
{
" i nt er act i on" : {
" i d" : " 1e376bcd3c3caa80e07422ab947f 4e52" ,
" t ype" : " t wi t t er "
},
" t wi t t er " : {
" cr eat ed_at " : " Mon, 06 J an 2014 10: 25: 13 +0000" ,
" f i l t er _l evel " : " medi um" ,
" i d" : " 420138980948992000" ,
" l ang" : " en" ,
" sour ce" : " web" ,
" t ext " : " Test t weet message" ,
" user " : {
" name" : " Paul Smar t " ,
" descr i pt i on" : " Test account " ,
" st at uses_count " : 41,
" f ol l ower s_count " : 2,
" f r i ends_count " : 0,
" scr een_name" : " Dat aSi f t Paul " ,
" pr of i l e_i mage_ur l " :
" ht t p: / / abs. t wi mg. com/ st i cky/ def aul t _pr of i l e_2_nor mal . png" ,
" pr of i l e_i mage_ur l _ht t ps" :
" ht t ps: / / abs. t wi mg. com/ st i cky/ def aul t _pr of i l e_2_nor mal . png" ,
" l ang" : " en" ,
" l i st ed_count " : 0,
" i d" : 2268899964,
" i d_st r " : " 2268899964" ,
" geo_enabl ed" : t r ue,
" ver i f i ed" : f al se,
" f avour i t es_count " : 0,
" cr eat ed_at " : " Mon, 30 Dec 2013 13: 59: 22 +0000"
}
}
}

Notice how the whole JSON object is surrounded by brackets and more brackets are
used to divide the object into blocks for i nt er act i on and t wi t t er . Each
augmentation adds an extra block. Each interaction is from a single source so only one
source block is ever present.
Attributes appear with values. JSON allows multiple data types for the values. When an
attribute has no value, the attribute does not appear in the JSON object.

Generating JSON Object Stream
The PUSH API can be used to generate a stream of JSON objects for viewing. In this
example, the hash from a previously compiled filter is used to stream 20 interactions:
$ curl -sX POST https://api.datasift.com/v1/push/create \
-d name=pullsub -d hash=dbdf49e22102ed01e945f608ac05a57e \
-d output_type=pull -d output_params.format=json \
-H 'Authorization: paul:6cab930bdf40cf89e68f2ecad2c'

$ curl -sX POST https://api.datasift.com/v1/pull \
-d id=37c2d26b6596f163276ed8ee1b8cacf4 -d size=1 \
-H 'Authorization: paul:6cab930bdf40cf89e68f2ecad2c'

{
" aut hor " : {
" avat ar " :
" ht t p: / / pbs. t wi mg. com/ pr of i l e_i mages/ nor mal . j peg" ,
" i d" : 215741751,
" l anguage" : " en" ,
" l i nk" : " ht t p: / / t wi t t er . com/ escami l 61" ,
" name" : " \ u2693\ uf e0f ci nt hi a escami l l " ,
" user name" : " escami l 61"
},

[output omitted]

5 Writing Simple Filters in CSDL
Curated Stream Definition Language (CSDL) is the language used to write filter
conditions on the DataSift platform. While CSDL can be used to create very complex
filters, this section covers the syntax and use of CSDL to create simple filters.
Filtering Condition Elements
Filters are made on one or more filter conditions. Each filter condition usually has a
Target, Operator and Argument. Sometimes only the Target and Operator are required.
In the editor, these are color coded with blue, red and green.

The example shown below, is a filtering condition with all three elements.
i nt er act i on. cont ent cont ai ns " st ar bucks"
Selecting Targets
A single interaction contains many individual attributes and values. For example, a
Reddit interaction contains an author name and a title. These attributes are called
targets and a filtering condition starts with the name of a target.
All the targets are listed in the developer documentation in the three groups: feeds,
augmentations and managed sources.
REFERENCE:
The targets available from each source are documented here:
http://dev.datasift.com/docs/targets

Target Operator Argument

Target Data Types
A target contains data which has a data type. Some common data types are listed
below.
Data Type Description Example
string A sequence of characters,
usually alphanumeric.
"dat asi f t "
" st ar bucks Fr appucci no"
int Integer.
A whole number without a
fractional or decimal
component.
4
7294745
array(int) A collection of integers
[ 34, 42, 88, 1]
float Floating point number
A number with a decimal
component.
4. 123
7294745. 9984
array(string) A collection of strings.
www. yahoo. com/ news. ht m,
www. yahoo. com/ spor t . ht m
geo A geographic region
represented as a circle from
a point with a radius, a
rectangle or a polygon.
51. 4553, - 0. 9689: 50
51. 4911, - 1. 0617: 51. 4194, - 0. 8921
51. 4615, - 0. 9864: 51. 4586, -
0. 9472: 51. 4466, - 0. 9412: 51. 4443, -
0. 9651: 51. 4445, - 0. 9831"

Target Examples
The following table shows some commonly used targets with their data type, an
example value, and a description of how it may be used.
Example Target Data
Type
Example
value
Description
interaction.content string
"J ust had a
cof f ee wi t h
Dave at
St ar bucks"
The interaction targets are
normalized from the original
data source. Content is the body
of the message.
facebook.type string
" phot o"
The type of content posted in a
Facebook update.
twitter.geo geo
51. 4553, -
0. 9689: 50
The location of the device used
to send the Tweet. This may not
always be available.
links.domain string
"bosch. com"
The domain name of the final
destination of a shared link.
twitter.retweet.count int
100
Number of times the interaction
has been retweeted.

Selecting Operators
The operator defines what comparison is made between the argument and the value of
the target. The choice of operator may affect the DPU cost of the filter condition.
Exists
The exists operator returns a true result if the named target exists in the input
interaction with any value. This is the only time a condition does not need an argument.
The example below shows the exists operator being used to identify interactions
which have a geographic value.
i nt er act i on. geo exi st s
WARNING:
Use the exists operator with great caution. It is likely to match a very large
number of interactions and will rapidly use data licensing credit.
String Operators
The == and != operators match interactions where the argument is exactly the same
string as the target value or is not the same string as the target value. The following
example matches interactions where the scr een_name contains exactly the string
shown and nothing more.
t wi t t er . user . scr een_name == " dat asi f t "
WARNING:
Use the != operator with great caution. It is likely to match a very large
number of interactions and will rapidly use data licensing credit.
For example, twitter.user.screen_name != "datasift" matches all
twitter interactions not from @datasift and all interactions from all other
active data sources.
The contains operator looks for a string anywhere in the value of a target. The
example shown below matches interactions if the content has the string " her t z
r ent al " anywhere in upper, lower, or mixed case.
i nt er act i on. cont ent cont ai ns " her t z r ent al "
The following table shows the result for each example value.
Example Value Result
"Where can I find Hertz Rental in SFO?" True
"Where can I find hertz rental in SFO?" True
"Why do Hertz never have my rental car" False

The contains_any operator uses a comma-separated list of strings. The condition
returns a true result if any one of the strings is matched in the target value.
i nt er act i on. cont ent cont ai ns_any " Hewl et t - Packar d, Hewl et t Packar d"
WARNING:
If two commas are used as shown in the example below, all interactions with a
space in them will be matched. This returns a very large number of
interactions.
interaction.content contains_any "thinkpad, ,lenovo"
The contains_near operator matches interactions when strings appear within a
specified number of words from each other. The strings cannot contain spaces.
i nt er act i on. cont ent cont ai ns_near " deer e, t r act or : 5"
I have bought a new John Deere Tractor True
Deere make the best tractor for me True
John Deere isn't going to be the right tractor for me False

The contains_all operator matches interactions which contain all of the strings in a
comma separated list.
i nt er act i on. cont ent cont ai ns_al l " deer e, t r act or "
I have bought a new John Deere Tractor True
Deere make the best tractor for me True
John Deere isn't going to be the right tractor for me True

The in operator matches interactions where the value is one of the listed strings or
integers. In the following example, the interaction is matched if the language is any one
of the three shown.
l anguage. t ag i n " en, de, es"
Brackets must be used if matching integer data types.
t wi t t er . user . i d i n [ 111111, 111112, 111113]

The previous string operators match whole words. To match strings which may be a
part of a word, use the substr operator.
i nt er act i on. cont ent subst r " gat or "
I almost got caught by an alligator" True
I need to drink some gatorade True
Case Sensitivity
Notice in the previous examples, the results were true regardless of the case used in the
string. By default, string operators are not case sensitive. To force case sensitivity, use
the case modifier with any string operator, as shown below.
i nt er act i on. cont ent [ case( t r ue) ] cont ai ns_near " FBI , CI A, NSA: 15"
Wildcard Operators
Wildcards are used to match strings with the following two special characters:
Special
Character
Description
? Match exactly one character in a string.
Can be included multiple times.
"famous??" matches "famous12" and "famously".
* Match any string of 0 or more characters.
"famous*" matches "famous", "famous1", and "famous123456789ly".

t wi t t er . t ext wi l dcar d " col o*r "

Numeric Operators
The == and != operators match numeric values which are, or are not, the same as the
argument.
t wi t t er . user . f ol l ower s_count == 100

t wi t t er . user . f ol l ower s_count ! = 0
To compare values which are greater or less than the argument, use the >, <, >=, and <=
operators. These work with integers and floating point numbers.
t wi t t er . r et weet _count >= 100

Geographic Operators
Some interactions contain geographic information. There are operators which identify
which interaction values are within an area defined by the argument.
The geo_radius operator defines a point and a radius. The point is defined as a
longitude and latitude, the radius is defined in kilometers. The following example
matches twitter interactions with geographic values within a 50km radius of the
coordinates.
t wi t t er . geo geo_r adi us " 51. 4553, - 0. 9689: 50"
When using the web application, geographical areas can be selected on a map.

Figure 77: Web Application Radius Configuration
The geo_box operator uses two sets of coordinates to define the upper-left and lower-
right corners of a rectangle.
i nt er act i on. geo geo_box " 51. 5013, - 1. 0997: 51. 4193, - 0. 8669"

To define more complex areas, use the geo_polygon operator which accepts up to 32
points. Interactions with geographic values within the points are matched.
i nt er act i on. geo geo_pol ygon " 51. 5002, - 1. 0815: 51. 4827,
- 1. 0107: 51. 4857, - 0. 9737: 51. 5079,
- 0. 9647: 51. 4925, - 0. 8693: 51. 4210,
- 0. 8954: 51. 4296, - 1. 0567"

Figure 78: Web Application Geo Polygon Configuration
URL Operators
Links are normalized by the DataSift platform to make URL matching easier. The
following are removed from URLs:
Protocol
Hostname
Query strings
Anchor tags
Names of common default pages
The url_in operator is used to match the argument with normalized URLs in the
target.
t wi t t er . l i nks ur l _i n " ht t p: / / appl e. com, ht t p: / / f i nance. yahoo. com"
HINT:
The links.normalized_url target is useful for filtering on normalized URLs.
links.normalized_url any "apple.com, finance.yahoo.com"

Negating Conditions
Logic can be added to negate the effect of a condition. In the following example,
interactions which do not contain the string 'data' are matched.
NOT t wi t t er . user . descr i pt i on cont ai ns " dat a"
WARNING:
This also matches Twitter interactions which have no value in the
twitter.user.description target and all interactions from other active
data sources. Use with multiple conditions.
Using Multiple Conditions
Most filters have more than one condition. When joining multiple conditions, a logical
operator is used.
Logical Operators
The AND operator is used when an interaction must match both conditions. In this
example, the content of a Tweet must include the word Starbucks and the language
must be English.
t wi t t er . t ext cont ai ns " st ar bucks"
AND t wi t t er . l ang == " en"
However, the example above only matches the word Starbucks in Tweets, not Retweets.
The OR operator could be used to look for Starbucks in either. Note the use of
parenthesis to group conditions.
( t wi t t er . t ext cont ai ns " st ar bucks"
OR t wi t t er . r et weet . t ext cont ai ns " st ar bucks" )
AND
( t wi t t er . l ang == " en"
OR t wi t t er . r et weet . l ang == " en" )
HINT:
Use interaction.content as a normalized form of twitter.text and
twitter.retweet.text
Use language.tag as a way of matching language across multiple data
sources.

Hints & Tips
Twitter Mentions
When using the t wi t t er . t ext or i nt er act i on. cont ent targets, Twitter users
mentioned as @user name in the text of a Tweet are removed from the text. Mentions
are available in the t wi t t er . ment i ons and t wi t t er . r et weet . ment i ons targets
with the @ symbol removed.
AND t wi t t er . ment i ons == " beyonce"
HINT:
Use interaction.raw_content to match mentions without the @ symbol
removed.
Hashtags
The # symbol is treated as punctuation by the DataSift platform so using '#datasift' as
an argument would match interactions with '#' and 'datasift' separated by any amount
of whitespace. Use the any operator to match hastags in twitter targets or use the
hashtag targets to match hashtags in all interactions.
t wi t t er . t ext any " #st ar bucks, #ner o, #cost a"

i nt er act i on. hasht ags i n " st ar bucks, ner o, cost a"

6 Configuring Streams CSDL Web
Application
Interactions which have been matched by your filter conditions are provided as an
output stream. This covers how to write a filter in CSDL using the web application,
validate it, compile it and preview the stream.
Enabling Sources
Interactions from a number of data sources can be filtered with the Web Application
CSDL editor. Before creating the filter, choose which sources to receive data from and
ensure they are enabled. Sources are divided into three types shown below.
Source types Description Example
Feeds Public sources of interactions Twitter
Tumblr
Reddit
Managed
Sources
Sources of interactions which require
authentication
Google+
Instagram
Yammer
Augmentations Extra information retrieved or
calculated about the interaction
Demographics
Sentiment
Language

Enable and disable Sources in the Data Sources page of the dashboard.

Figure 79: Location of Data Sources Configuration
Refer to the training module on Configuring Sources for more detail on configuring
sources.

Writing Filters with CSDL Editor
From the Streams page, click Create Stream.

Figure 80: Creating New Stream
Enter a Name for the new stream along with an optional Description. Select CSDL
Code Editor before clicking Start Editing.

Figure 81: Stream Name and Editor Selection
Using Targets
The code editor opens with numbered lines. These line numbers are used to reference
which line has a problem if code validation fails.
As targets and operators are typed, the editor provides a list of possible completions.
Either continue typing or select one from the list. When a target is highlighted, a
description appears. The More link shows more information.

Figure 82: Target Auto-Completion and Hints

Using Lists
When using an operator which allows a list of arguments to be defined, the List button
becomes available. In this example, the i n operator is used and the List button clicked.

Figure 83: List Configuration
The list editor opens in editing mode. Elements of the list are entered followed by
return. To edit any existing element, click in on the text.

Figure 84: Editing Lists

In re-ordering mode, the list editor allows elements to be dragged and dropped into a
new order.

Figure 85: Manual Re-ordering
In deleting mode, click on an element to remove it from the list.

Figure 86: Deleting Elements

The Import button opens a dialogue box where comma separated value (CSV) files are
uploaded or pasted.

Figure 87: Importing Lists
Return to the CSDL editor to see a collapsed form of the list. Click the + symbol to
expand and collapse the list.

Figure 88: Collapsed List in Editor

Using Geo Operators
Notice that targets are automatically colored blue, operators are red and arguments are
green. When using geographic arguments, click the Geo Selection button to open a
map.

Figure 89: Coloring and Geo Button
The map allows coordinates and radius to be selected by clicking on a map rather than
typing latitudes and longitudes. Use the search box to locate a place then click once to
define the center of a geo_r adi us and click again to define the perimeter.

Figure 90: Geo Radius Configuration

To define an arbitrary shape with up to 32 points, use the geo_pol ygon operator. Click
the Geo Selection button and click for each point. Points may also be dragged to
adjust the polygon. In this example, the perimeter of London Heathrow Airport is
defined.

Figure 91: Geo Polygon Configuration

Using Versions
Every time the CSDL is saved, the editor saves it as a new version. Using the date drop-
down menu, it is possible to revert to previously saved versions of the code.

Figure 92: Code Versions
WARNING:
Different versions of the same filter have different stream hashes. Any
recording or consumption of a stream is tied to a particular stream hash.

Validating Filters
The validate button checks the CSDL syntax. Use this before saving.

Figure 93: Validating CSDL
If the code is free from errors, the following message is displayed:

Figure 94: Validation Pass
If there are problems with the CSDL, a descriptive message is displayed identifying the
line and character position the problem is found and the error. In this example,
geographic coordinates are missing from a geo_r adi us operator.

Figure 95: Validation Fail

Compiling Filters
When the code has passed validation, click Save & Close. This compiles the code and
saves it in the DataSift platform.

Figure 96: Saving and Closing
Every compiled filter is saved in the platform and referenced by a hash. To view the
hash for a filter, click Consume via API from the stream summary page.

Figure 97: Consume via API Button
Among other details, the Stream Hash is displayed. This is the unique hash which
references your filter. In later modules, this hash is used to reference the filter from the
API and to reference these filter conditions from within another filter.

Figure 98: Stream Hash
NOTE:
Remember to update any applications consuming a stream to the new hash
after each change. The old hash still references the previous version.

Previewing Streams
The stream preview is used to review and fine-tune filters by identifying irrelevant
interactions in the output stream. The filter conditions can then be modified to exclude
them.
From the Streams page, click on the stream name to open the summary page. Click Live
Preview.

Figure 99: Live Preview Button
A summary of enabled sources is shown. Check that only the required sources are in
the list. Enabling more than the required sources may unnecessarily increase the data
licensing cost.

Figure 100: Summary of Data Sources

Click the play button at the bottom of the screen to start previewing interactions
matched by the filter conditions.

Figure 101: Play Button
Interactions with their augmentation icons and information appear as they are
matched. To stop the preview, click the pause button.

NOTE:
The number of interactions sent per second is limited in preview mode.

7 Configuring Categorization
Categorization allows user-defined meta-data to be added to interactions based on
conditions.
While interactions may be filtered and sent to a destination based on one set of
conditions, another set of conditions can be used to assign tag strings and values to
those interactions. This extra information can greatly simplify the task of post-
processing interactions and opens the platform to machine learning where tags and
values are assigned based on programmable intelligence.
The assignment of simple tags has been present in the platform for some time. The
latest additions are tagging using namespaces, scoring and cascading tags. The new
additions are part of the DataSift VEDO feature.
Files which defines tags and scores are also known as Classifiers.

Configuring Tagging
Tags are user-defined strings of text added to output interactions based on the results
of conditions.
Writing Tag Conditions
Tags are defined using the t ag keyword in CSDL. This is followed with one or more
conditions grouped in curly brackets. In this example, the tag "US" is being applied to
interactions which match the conditions.
t ag " US" {
t wi t t er . user . l ocat i on cont ai ns_any " usa, uni t ed st at es"
or t wi t t er . pl ace. count r y cont ai ns_any " usa, uni t ed st at es"
or t wi t t er . pl ace. count r y_code == " US"
}

The condition syntax used for tagging is exactly the same as the condition syntax used
for filtering but no filtering takes place. Multiple tags can be defined in a single CSDL file.
Filtering is a separate section of the CSDL file and must be identified using the r et ur n
keyword. In this example, t ag and r et ur n keywords are used in a single CSDL file.
t ag " UK" {
t wi t t er . pl ace. count r y_code == " UK"
or t wi t t er . user . l ocat i on cont ai ns_any " uk, uni t ed ki ngdom"
}

t ag " US" {
t wi t t er . pl ace. count r y_code == " US"
or t wi t t er . user . l ocat i on cont ai ns_any " usa, uni t ed st at es"
}

r et ur n {
}

NOTE:
It is not possible to filter interactions based on tag values

Interpreting Tags in Output Fields
Interactions which match the filtering conditions are augmented with the appropriate
tags and the tag values are available in the output interaction.
In this example, the preview is used to view an interaction which matched a filtering
condition and tagging condition which applied the US tag.

Figure 103: Interaction Preview With Tags
Where multiple tags are added to the same interaction, they become elements of an
array named t ags.
Tags in JSON
When consuming the filter output in a JSON format, the tags appear as attributes and
values.
" aut hor " : {
<output omitted>
" schema" : {
" ver si on" : 3
},
" sour ce" : " Twi t t er f or i Phone" ,
"tags": [
"US"
],
" t ype" : " t wi t t er "
},

Tags Mapped to Database Fields
A mapping file may be required to map interaction attributes to database fields. In the
case of MySQL databases, this is achieved with an .ini file written by hand or
constructed with the SQL Schema Mapper.
A list iterator is required to iterate multiple tag values.
REFERENCE:
http://dev.datasift.com/docs/push/connectors/ini/list-iterator

Configuring Tag Namespaces
The previous examples used a single string (e.g. UK or US) as a tag which is a flat
namespace. Using a tag tree may be more suited to the desired tagging namespace.
Writing Tag Tree Conditions
In this example, the assignment of US and UK tags is placed into a hierarchy with a
branch for country. The format uses periods between elements of the path.
t ag. count r y " UK" {
}

t ag. count r y " US" {
}

r et ur n {
}

The following example assigns tags in a hierarchy to match device attributes:
t ag
o devi ce
name
os
f or mat
manuf act ur er
t ag. devi ce. name " i Phone" {i nt er act i on. sour ce cont ai ns " i Phone" }
t ag. devi ce. name " Andr oi d" {i nt er act i on. sour ce cont ai ns_any
" Andr oi d" }

t ag. devi ce. os " i OS" {i nt er act i on. sour ce cont ai ns_any
" i OS, i Phone, i Pod, i Pad" }
t ag. devi ce. os " Andr oi d" {i nt er act i on. sour ce cont ai ns_any " Andr oi d" }

t ag. devi ce. f or mat " Mobi l e" {i nt er act i on. sour ce cont ai ns_any
" i Phone, i Pod, mobi l e web, phone, Bl ackber r y" }
t ag. devi ce. f or mat " Deskt op" {i nt er act i on. sour ce cont ai ns_any " web,
Tweet But t on, Twi t t er f or Mac, Tweet f or Web" and not
i nt er act i on. sour ce cont ai ns " mobi l e" }

t ag. devi ce. manuf act ur er " Appl e" {i nt er act i on. sour ce cont ai ns_any
" i Phone, i Pod, i Pad, OS X, Mac" }
t ag. devi ce. manuf act ur er " HTC" {i nt er act i on. sour ce cont ai ns " HTC" }

NOTE:
Values cannot be assigned at branches, only leaf nodes

Interpreting Tag Namespace in Output Fields
The tags are assigned using a tree structure in the output. The preview shows
t ag_t r ee in the interaction augmentation, which is expanded to reveal the hierarchy
and values.

Figure 104: Tag Tree Preview Example
The JSON output shows a similar hierarchy in the interaction section.
" aut hor " : {
<output omitted>
" sour ce" : " Twi t t er f or Andr oi d" ,
"tag_tree": {
"device": {
"name": [
"Android"
],
"os": [
"Android"
]
}
},
<output omitted>

Configuring Scoring
Scoring extends tagging by associating a numeric value to an interaction which matches
a condition.
Writing Scoring Conditions
Scoring allows the numeric value to be incremented or decremented. This allows the
final value to be varied as multiple conditions are evaluated. In this example, the score
is assigned or incremented by different values depending which group of conditions are
matched. The final score is an indication of probability that the author was in the USA.
t ag. count r y. US +10 {
i nt er act i on. geo geo_pol ygon " 48. 80686346108517, -
124. 33593928813934: 48. 922499263758255, <out put omi t t ed>"
}
t wi t t er . user . l ocat i on cont ai ns_any " usa, uni t ed st at es"
or t wi t t er . pl ace. count r y cont ai ns_any " usa, uni t ed st at es"
or t wi t t er . pl ace. count r y_code == " US"
}
t wi t t er . user . l ocat i on cont ai ns_any " Al abama, Al aska, Ar i zona,
Ar kansas, Cal i f or ni a, Col or ado, Connect i cut , Del awar e, Fl or i da
<out put omi t t ed>"
or t wi t t er . pl ace. f ul l _name cont ai ns_any " Al abama, Al aska,
Ar i zona, <out put omi t t ed>"
or t wi t t er . user . l ocat i on cont ai ns_any " Abi l ene, Akr on,
Al buquer que, Al exandr i a, Al l ent own, Amar i l l o, Anahei m, Anchor age"
or t wi t t er . pl ace. f ul l _name cont ai ns_any " Abi l ene, Akr on,
Al buquer que, Al exandr i a, Al l ent own, Amar i l l o, Anahei m, Anchor age"
}
t wi t t er . user . t i me_zone cont ai ns_any " Al aska, Ar i zona, At l ant i c
Ti me ( Canada) , Cent r al Amer i ca, Cent r al Ti me ( US & Canada) , East er n
Ti me ( US & Canada) , Mount ai n Ti me ( US & Canada) , Paci f i c Ti me ( US &
Canada) "
}
r et ur n {
}

Interpreting Scores in Output Fields
In the example output, the interaction acquired a total score of 18.
10 for having geo parameters in the USA, another 5 for having a country code of "US" or
country of "United States", then a final 3 for matching a state or city name and the time
zone (not shown).

Figure 105: Example Tag Scoring in Preview

Decrementing & Incrementing Scores
Scores may be decremented as well as incremented. The following excerpt shows an
excerpt of tagging logic which assigns a value based on likelihood that an interaction is a
customer service 'rave' or 'rant'.
t ag. r ave 0. 793474 {i nt er act i on. cont ent cont ai ns " gr eat " }
t ag. r ave 0. 611286 {i nt er act i on. cont ent cont ai ns " t hank you" }
t ag. r ave - 0. 001199 {i nt er act i on. cont ent cont ai ns " cancel l ed?" }
t ag. r ave - 0. 001199 {i nt er act i on. cont ent cont ai ns " been on hol d" }
t ag. r ave - 0. 001199 {i nt er act i on. cont ent cont ai ns " any way t o" }
t ag. r ant 0. 699983 {i nt er act i on. cont ent cont ai ns " f ai l " }
t ag. r ant 0. 529781 {i nt er act i on. cont ent cont ai ns " never " }
t ag. r ant - 0. 001199 {i nt er act i on. cont ent cont ai ns " you t r y t o" }
t ag. r ant - 0. 001199 {i nt er act i on. cont ent cont ai ns " you t r y" }
t ag. r ant - 0. 001199 {i nt er act i on. cont ent cont ai ns " you f or your " }

Machine Learning
Scoring opens the platform to the advantages of machine learning. A sample set of
interactions is classified or scored by a human and a machine determines what content
causes the human to classify or score in a particular way.
Classification conditions, tags and values are then generated by the machine to ensure
future interactions are classified in the same way a human would.
The previous example of rules scoring rants and raves was constructed by machine
learning.

Configuring Cascading
Tags defined in one CSDL file can be referenced from within other CSDL files known as
cascading. This allows Tags to be written once and used in multiple filters. The tags are
referenced by a hash which changes each time the tag file is modified.
Writing Re-usable Tag Definitions
A CSDL file containing just tags, which is to be referenced by another CSDL file, does not
need to have a r et ur n statement. In this example, the file only contains tag statements.
t ag. count r y " UK" {
}

t ag. count r y " US" {
}

When saved in the web application, the usual stream summary buttons are not
available because the file doesnt contain conditions to match interactions to be sent to
a destination.

Figure 106: Stream Summary for Tag File
The only actions available are to edit and share the CSDL or use the tags from within
another CSDL file.

Re-using Tag Definitions
The previously defined tags are used within another filter by using the t ags statement
followed by the hash.
t ags " f b0960ab37ef 2ed4f 42049a4a812d066"

r et ur n {
i nt er act i on. cont ent cont ai ns " st ar bucks"
}

NOTE:
The hash changes every time the tags file is modified
Multiple tag files may be referenced in a filter and tags can be cascaded multiple times.
In this example, two tag files are referenced in a filter file which also includes its own
tags.
t ag. model " i Phone" {i nt er act i on. sour ce cont ai ns " i Phone" }
t ag. model " i Pad" {i nt er act i on. sour ce cont ai ns " i Pad" }
t ag. model " Bl ackber r y" {i nt er act i on. sour ce cont ai ns " Bl ackber r y" }
Figure 107: Model Tags with hash 55a1eeab06858259a401bfc16b7771ce
t ag. devi ce. f or mat " Mobi l e" {i nt er act i on. sour ce cont ai ns_any
" i Phone, Bl ackber r y" }
t ag. devi ce. f or mat " Tabl et " {i nt er act i on. sour ce cont ai ns " i Pad" }
Figure 108: Device format tags with hash 2ff3a1745c92503d1a534228856ba4e4
t ags " 55a1eeab06858259a401bf c16b7771ce"
t ags " 2f f 3a1745c92503d1a534228856ba4e4"

t ag. devi ce. model " Andr oi d" {i nt er act i on. sour ce cont ai ns " Andr oi d" }

r et ur n {
i nt er act i on. cont ent cont ai ns " Veni ce" OR l i nks. t i t l e cont ai ns
" Veni ce"
}

Any tagging taking place in the file which contains the r et ur n block must come after
the imported tags.
WARNING:
Care should be taken to avoid overlapping tagging hierarchies.

Including Library Classifiers
A library of commonly-used classifiers is available for including in filters. They include
classifiers which just tag interactions or tag and score. Some are developed using
machine learning.
To locate the classifiers, click the Streams tab and select Library.

Figure 109: Library Page
Example 1 People vs. Organizations
The People vs. Organizations is a machine learned classifier that distinguishes authors
between individual people and organizations.

Figure 110: People vs. Organizations Classifier
Clicking on the classifier opens a summary page which starts with a summary of the
classifier type, hash and which source it applies to. In this example, the classifier only
applies to Tweets. The Copy button copies the hash for pasting into a filer.

Figure 111: Example Classifier Summary

The classifier definition shows a summary of the tagging CSDL. The Copy to stream
button opens a new blank filter in the CSDL editor and adds the tags. The Copy button
places the tags in the copy buffer to be pasted into another filter.

Figure 112: Classifier Definition
At the bottom of the page are tabs showing examples of using this classifier in other
filters and the hierarchy of tags provided.

Figure 113: Tag Descriptions

Using Libraries
Locate the classifier hash from the classifier summary page, or copy the tags statement.

Figure 114: Tag Include Example
Edit a new or existing filter and paste the tags statement before any local tags
statements. Ensure the filtering conditions are enclosed in a return block.
t ags " bf b5bc9a599aa04f 91b8a1dc4ae44d45"

r et ur n {
i nt er act i on. cont ent cont ai ns_al l " j ava, j ob"
}

The classifier is verified by checking preview output. This example shows tags and
scores are being applied.

Figure 115: Example of Scores Correctly Applied

Billing
Simple Tagging
Operators used inside a tag statement are normally charged at 10% of their usual DPU
cost.
For example, if the normal cost of a rule is 1 DPU, that same code inside a tag statement
costs 0.1 DPU.
Advanced Tagging (VEDO)
If you use tags with namespaces or scoring rules, or cascade tags from one filter to
another, the pricing is based on the combined cost of operators in the tagging logic and
in the filter definition.
The number of times each operator appears is counted and the overall cost calculated.
For example, if the cont ai ns operator is used nine times in tagging and used twice in
the filtering logic, the charge is for 11 uses of that operator.
Reviewing Processing Cost
The summary of any filter including tags displays the DPU cost.

Figure 116: DPU Cost of a Filter using VEDO

When VEDO features are used, the stream breakdown does not show DPU costs per
operator. It shows a summary of the tag statements showing which of these are within
the r et ur n block.

Figure 117: Stream Breakdown

8 Configuring Streams API
Filters can be written programmatically through the Application Programming Interface
(API). The API is used for submitting, validating and compiling filters. This section
explains how to write and submit filters to the DataSift platform and preview the output
stream using the API.
Enabling Sources
Interactions are only sent to the filter if one or more sources is enabled. Sources are not
enabled and disabled programmatically, they must be configured from the web
application.
Log in to http://datasift.com and select the Data Sources tab. Ensure the required
sources are activated.

Figure 118: Enabling Data Sources
More information on data sources is available in the 'Configuring Sources' training
module.

Making API Calls
Calls to the DataSift Platform API are made using HTTPS requests. The calls are usually
made programmatically but for testing, it is possible to make the calls from the
command line with utilities such as cur l .
There are two types of API available which use different URLs.
API Type API Domain Description
REST API https://api.datasift.com/v1/ Representational state transfer.
Used for validating and compiling
CSDL. Also used for configuring
destinations, starting, stopping and
querying jobs.
Streaming API http://stream.datasift.com/ Used for real-time streaming of
interaction streams which continue
until stopped.

Validating Filters
The first example of using the API is to validate a simple piece of CSDL code. The CSDL
being used is shown below:
The CSDL can be passed as a query sting in the URI. The following example shows the
val i dat e endpoint being called with the CSDL as a query string:
ht t ps: / / api . dat asi f t . com/ v1/ val i dat e?csdl =t wi t t er . t ext %20cont ai ns%20
%22st ar bucks%22
Notice that spaces and double-quotes are URL-encoded, substituting ASCII percent-
encoded values of %20 and %22.

Using API Authentication
Every call to the REST API must be authenticated so user credentials may be added as
parameters in the query string. The credentials required are username and API key.
&user name=<user name>&api _key=<api _key>
The user name and API key are available from the web application dashboard.

Figure 119: Location of Username and API Key
The complete validation request with authentication is shown below:
ht t ps: / / api . dat asi f t . com/ v1/ val i dat e?csdl =t wi t t er . t ext %20cont ai ns%20
%22st ar bucks%22&user name=paul &api _key=b366978a8ee4e36c9d2171ccee4e12
34
This request could be made using the cur l utility and a query string.
$ curl \
"https://api.datasift.com/v1/validate?csdl=twitter.text%20contains%2
0%22starbucks%22&username=paul&api_key=b366978a8ee4e36c9d2171ccee4e1
234"
The command is more readable if used with arguments rather than a query string.
$ curl X POST https://api.datasift.com/v1/validate \
-d 'csdl=twitter.text contains "starbucks"' \
-H 'Authorization: paul:366978a8ee4e36c9d2171ccee4e1234'

Validation Failure
All information returned from REST API calls is in JavaScript Object Notation (JSON). If
the CSDL is not correct, the validation fails and a JSON object is returned with an error
message.
{" er r or " : " We ar e unabl e t o par se t hi s st r eam. . .
<output omitted>
Validation Success
If the CSDL has passed validation, a JSON object is returned similar to the following
example.
{
" cr eat ed_at " : " 2013- 11- 07 16: 08: 03" ,
" dpu" : " 0. 1"
}
The timestamp is when the CSDL was first validated. The dpu value is the number of
Data Processing Unit hours used by the CSDL being validated if executed.
Compiling Filters
Filters are submitted to the DataSift platform for compilation. Compiled filters are
available to be run. Use the same syntax as the validate example but using the compile
API endpoint.
ht t ps: / / api . dat asi f t . com/ v1/ compi l e?csdl =t wi t t er . t ext %20cont ai ns%20%
22st ar bucks%22&user name=paul &api _key=b366978a8ee4e36c9d2171ccee4e123
4
In this example of making a compile request using curl, a JSON object is returned.
$ curl -X POST https://api.datasift.com/v1/compile \
-d 'csdl=twitter.text contains "starbucks"' \
-H 'Authorization: paul:366978a8ee4e36c9d2171ccee4e1234'
{
" hash" : " bf d56316e55d8d480b89c7a653359e6d" ,
" cr eat ed_at " : " 2013- 11- 08 12: 37: 31" ,
" dpu" : " 0. 1"
}
The ' hash' returned in the JSON object is a unique reference for the compiled CSDL on
the platform.

Referencing Web Application Filters
There may be times when it is more convenient to use the CSDL editor in the web
application to create or modify filters. The CSDL editor provides color highlighting of
code, target and operator auto-completion and helpers for geo and list arguments.
After the filter has been written and saved in the web application, refer to the stream
summary page to see its hash. The filter is referenced by this hash in the API.

Figure 120: Hash Location in Stream Summary
NOTE:
Filters created in the API are not visible in the list of streams in the web
application.

Previewing Streams
There is no preview API endpoint which is similar to the preview in the web application.
To see a few interactions from the filter, use the st r eamPUSH API endpoint. Specify the
filter hash returned from the compile call:
hash=bfd56316e55d8d480b89c7a653359e6d
Along with a count of the number of interactions to display:
count=2
Example cur l command retrieving two interactions:
$ curl -X POST https://api.datasift.com/v1/stream \
-d 'hash=bfd56316e55d8d480b89c7a653359e6d' \
-d 'count=2' \
-H 'Authorization: paul:b366978a8ee4e36c9d2171ccee4e1234'

9 Configuring Stream Recording
So far, you have covered how to create multiple filter conditions and submit them for
compilation in the DataSift platform using either the web application or API. This section
explains how to record the stream for later analysis.
Data Destinations
Previewing the stream of interactions in the web interface is a good way to verify the
filter conditions are matching the desired interactions but the streamed data is not
stored anywhere.
To store the streamed data for later analysis, a destination is required. Destinations are
listed on the Data Destinations page of the web interface. Click on Browse
Destinations to see all available destinations.

Figure 121: Data Destinations Page
Notice that some destinations have a padlock icon; these deliver the data using more
secure protocols and authentication. Any destinations which are unavailable to the
account display an Inquire button to request access.
Click the + symbol to add one of these destinations. Configuration of each destination is
covered in separate training modules.
DataSift Storage
DataSift Storage is available to all accounts as a way of temporarily storing streams. The
data is held on the DataSift platform indefinitely. If no other destinations have been
configured, DataSift Storage is the default destination for recording tasks.

Starting Record Tasks
To start recording the stream from a filter, locate and click the filter name on the
Streams page.

Figure 122: Select Filter
The summary page opens with options to Use This Stream. To create a recording task
from this filter, click the Record Stream button.

Figure 123: Record Stream Button
In step one, the new recording task requires start and end times. Start times are
available up to two years in the future or the Start Now option starts the recording as
soon as the task is submitted.

End dates are also available up to two years in the future or use the Keep Running
option. Check the time zone has been correctly detected and give the task a name
before clicking Continue.

Figure 124: Recording Task Step 1
In step two, select which destination is used for the raw data stream. Use the settings
icon next to each destination to review the configuration. DataSift Storage is a free
temporary destination allowing download of the data or saving to an alternative
destination.

Figure 125: Recording Task Step 2
NOTE:
If no other destinations have been configured, DataSift Storage is used and the
preceding step is not displayed.

In step three, review the new recording summary and click Start Task.

Figure 126: Recording Task Confirmation

Viewing Record Tasks
To view queued, running and completed recording tasks, open the Tasks page. Tasks
are divided into Recordings and Historic Queries. The page opens showing all tasks,
including the previously configured recording.

Figure 127: Tasks Page
The percentage shown after Task running reflects where the recording is in the total
recording time. Click on the task name to view more information about the task. In this
example, 21 interactions have been recorded from the Starbucks filter and the stream
is still running.

Figure 128: Recording Task Details

Pausing Record Tasks
When recording to any destination except DataSift Storage, it is possible to pause data
delivery. This could be used when the destination is offline for maintenance.

Figure 129: Pause Delivery Button
Data is buffered for up to one hour. To resume delivery, click the Resume Delivery
button.

Figure 130: Resume Delivery Button
Stopping Record Tasks
To stop a recording task before the configured stop time, locate the task in the Tasks
page and click Stop Task. Data continues to be delivered until the buffer is empty or the
data in the buffer has expired.

Figure 131: Stop Task Button

Exporting Record Task Data
When the recording task is complete using DataSift storage, the output interactions
must be exported. Exporting allows the raw data collected by the recording task to be
available in configurable formats and for specified time and date ranges.
Locate the completed recording in the Recordings or All Tasks tabs of the Tasks page.
Click Export Data.

Figure 132: Locate Export Data Button
Complete the Name field and select the Format from a choice of Comma-Separated
Values (CSV) and JavaScript Object Notation (JSON). Select start and end times and
choose the storage destination for the exported interactions.

By default, all output fields are exported. To make all fields available for individual
selection, click to clear the All check box. When the export configuration is complete,
click Create.

Figure 133: Export Dialogue
The task details are displayed with completion of the export shown as a percentage.

Figure 134: Export Completion

When the export is complete, a notification is emailed to the logged in user.

Figure 135: Export Complete Email Notification
Exporting Running Tasks
When using DataSift Storage as the destination for the raw data, it is possible to export
the stream to a chosen destination, in a desired format, while the task is running. Use
the Tasks page to locate the running record task and click Export Data.

Figure 136: Availability of Export on Running Tasks

Downloading Exported Data
The data is now available for download. All exports are listed on the Tasks page with
download links. Click a Download link to download the exported data.

Figure 137: Exported Data in Tasks Summary
Delete Exported Data
From the Tasks page, click on the task name to display the task summary. Click the
Delete link by each export to delete the exported data from DataSift Storage. This does
not delete downloaded copies.

Figure 138: Delete Links for Exported Data
NOTE:
Exports to DataSift Storage expire after two weeks

Deleting Record Tasks
To delete a recording task, open the Tasks page and click the Delete Task link.

Figure 139: Delete Task Link
Click OK to confirm the export task deletion.

Figure 140: Delete Task Confirmation
WARNING:
Deleted data is not recoverable



10 Configuring Historic Previews
So far, you have seen how to create multiple filter conditions and submit them for
compilation in the DataSift platform using the web application. After completing this
module you will be able to query the historic archive for a preview of interactions which
match a filter.
Historic Archive
DataSift has an archive of interactions which is queried using the same filters used for
live interactions. The archive is many petabytes in size and growing rapidly.
Capturing of interactions to the archive started at different times for each source. Some
of the oldest interactions may not be fully augmented. The most complete archive is for
Twitter interactions.
Pre-December 2011
o Partially augmented Twitter
August 2012
o Demographics added
May 2012
o Facebook added
o More augmentations added
November 2012
o Bitly & Newscred added
July 2012
o Fully augmented
Aug-Dec 2013
o Tumblr & WordPress added

REFERENCE:
Full list of sources and earliest archived interactions
http://dev.datasift.com/docs/historics/archive/schema
Interactions are retrieved form the archive many times faster than real-time
interactions. Destinations must be configured to cope with a data stream which may be
100x faster than live streams.
Historic tasks are queued for execution. Notifications are sent when tasks are complete.
Historic Preview
Historic preview allows filters to be applied to a random 1% of the archive for any
duration between one hour and 30 days. One of several pre-configured reports are
created from the filtered interactions.
This is ideal for testing a filter before running a larger historic task on 100% of the
interactions. For analyzing trends, in most cases the results of the preview will be
sufficient.

Report Types
There are 5 built-in reports and the ability to configure a custom report.
1. Basic Preview
The basic preview report contains five charts.
Interaction Volume
1% of the volume of interactions over the
specified period of time.

Interaction Tags
When interactions have been tagged,

Interaction Content
A world cloud of words found in the
content of the interaction.

REFERENCE:
http://dev.datasift.com/docs/tags/applying
-tags
Interaction Type
A pie chart showing the proportion of
interactions found from each source.

Language Tag
A pie chart of the languages found in
interactions matching the filter.

2 Twitter Preview
The Twitter preview only shows Twitter interactions from the filter but shows more
detailed Twitter information.
Twitter Text
A word cloud of the most commonly
occurring words in the Tweet text.

Twitter User Description
A word cloud built from the user's
description of themselves.

Language Tag
The proportions of each language
identified.

Twitter User Time Zone
The distribution of user's time zones in the
matching interactions.

Twitter ID
The volume of interactions in the preview.

Twitter User Language
The top languages found in Twitter text.

Twitter Hashtags
A pie chart of the most common hashtags
found in matching Tweets.

3 Links Preview
The links preview shows the volume of links along with Twitter card data and Facebook
OpenGraph data.
Links Code vs. Links Meta vs. Links
Meta OpenGraph
The dark blue line is the number of links
found in all the interactions. When the links
are followed, the light blue line shows the
number of pages with Facebook
OpenGraph meta tags; the orange line
shows the number of pages with Twitter
Card meta tags.

Links Meta Language
When links are followed, this shows the value
of a language meta tags used on the page.

Links Meta Keywords
The value of keyword meta tags used on
the linked pages.

Links Title
A word cloud showing the words found in
page titles of linked pages.

4 Natural Language Processing
Data of special value to natural language processing specialists, including sentiment for
the content and title plus entities found in the content and title.
Salience Content Sentiment
Salience is the positivity or negativity of
comments with neutral being a zero value.

Salience Title Sentiment
The positivity or negativity of interaction
titles.

Salience Content Topics
The products and topics being talked
about.

Salience Title Topics
Topics in the titles of interactions.

5 Demographics
This preview is only available on accounts with the Demographics augmentation
activated. The preview shows anonymized demographic information including gender,
age, location by city, state, and country, and profession and likes and interests.
Twitter Text
World cloud of words found in the Tweet.

Demographic Age Range
A pie chart of the most common ages of
authors.

Twitter Hashtags
Top hashtags found in Tweets matched by
the filter.

Demographic First Language
A pie chart of the most common first
language among the authors.

Demographic Sex
The sex of the author.

Demographic Location (US State)
The top US state locations of the authors.

Demographic Location (City)
The top city locations of interaction
authors.

Demographic Professions
A pie chart of the top interaction author
professions.

Demographic Location (Country)
The top country locations of interaction
authors.

Demographic Likes & Interests
The most common likes and interests of
the interaction authors.

6 Custom Preview
A report is generated by selecting targets from a list. Targets with a numeric value allow
selection of Volume or Mean, Min, Max charts.

Targets with an array of string values allow selection of Volume or Top.

Targets which have free text values allow selection of Volume or Word Cloud.

Historic Preview Billing
Historic previews are billed at a fixed cost of 10 DPUs with an extra 2 DPUs added for
each day in the preview. The costs are the same for every report type, even custom
reports with lots of targets.
For example, a preview covering 5 days:
Preview cost 10 DPU
Per day cost 5x 2 DPU = 10 DPU
Total 20 DPU
There are no data licensing costs for historic preview because the interactions are never
delivered, only charts and aggregated statistics about the interactions.
Previews which are cancelled before completion are not charged.

Configuring Historic Preview
To run a historic preview, first select the filter.

From the filter summary page, click Historic Preview.

Figure 142: Click Historic Preview
Only preview is available for a filter at any one time. If a preview has already been
made, click the Overwrite Historic Preview button to allow a new preview to be
configured.

Figure 143: Overwrite an existing preview

Ensure the time zone is correct and select the start and end dates and times.

Figure 144: Select start and end times
The DPU cost is calculated automatically.

Figure 145: Processing Cost
Clicking "?" shows the calculation.

Figure 146: DPU Cost Calculation

Select one of the 6 report types.

Figure 147: Select Report Type
Wait for the report to build. This typically
takes less than 5 minutes. The more days
are included in the preview, the longer it
takes to complete the report building.

Figure 148: Report Building
When the building is
complete, the report is
displayed, a notification is
created and an email is
sent.

Figure 149: Report Complete Email

Downloading Reports
The whole report cannot be downloaded but each individual chart can.
Each chart has a download icon. Click this icon to open a download window.
Click Download to download a PNG image file.

Figure 150: Download PNG File



11 Configuring Historic Stream
Recording
So far, you have covered how to create multiple filter conditions and submit them for
compilation in the DataSift platform using either the web application or API. This section
explains how to query the historic archive for all interactions which match a filter.
Historic Tasks
DataSift has an historic archive of interactions which is queried using the same filters
used for live interactions.
Capturing of interactions to the archive started at different times for each source. The
most complete archive is for Twitter interactions.
Pre-December 2011
o Partially augmented Twitter
May 2012
o Facebook added
o More augmentations added
July 2012
o Fully augmented
August 2012
o Demographics added
November 2012
o Bitly & Newscred added
Aug-Dec 2013
o Tumblr & WordPress added
REFERENCE:
Full list of sources and earliest archived interactions
http://dev.datasift.com/docs/historics/archive/schema
Historic tasks are queued and processed 100x faster than live streams.
Up to 31 days of archive can be queried in one task. To retrieve a longer duration, use
multiple historic tasks.

Data Destinations
To store the matching interactions for later analysis, a destination is required.
Destinations are listed on the Data Destinations page of the web interface. Click on
Browse Destinations to see all available destinations.

Figure 151: Data Destinations Page
Notice that some destinations have a padlock icon; these deliver the data using more
secure protocols and authentication. Any destinations which are unavailable to the
account display an Inquire button to request access.
Click the + symbol to add one of these destinations. Configuration of each destination is
covered in separate training modules.

Starting Historic Tasks
To start a historic task, locate and click the filter name on the Streams page.

The summary page opens with options to Use This Stream. To create a historic task
from this filter, click the Historic Query button.

Figure 153: Historic Query Button
In step one, the new historic task requires start and end times. Start times are available
back to the start of the archive in 2010 and for a duration up to 31 days after the start
date
Check the time zone has been correctly detected and give the task a name before
clicking Continue.
Select the required data source and choose to query all of the archive (100%) or a 10%
sample of the interactions in the archive.


Figure 154: Historic Task Step 1
In step two, select which destination is used for the stream. Use the settings icon next to
each destination to review destination configuration.

Figure 155: Historic Task Step 2

In step three, review the New Historic Query summary and click Start Task.

Figure 156: Historic Task Confirmation

Viewing Historic Tasks
To view queued, running and completed historic tasks, open the Tasks page. Tasks are
divided into Recordings and Historic Queries. The page opens showing all tasks,
including the previously configured Historic Query.

Figure 157: Tasks Page
The task stays in a Task queued state until it reaches the top of the queue.
When running, the percentage shown after Task running reflects where the historic
query is in the total duration of the query. Click on the task name to view more
information about the task. Historic queries are split into chunks for processing and the
progress of each chunk is seen.

Figure 158: Historic Task Details

Pausing Historic Tasks
It is possible to pause historic tasks in the queue and to pause data delivery. This could
be used when the destination is offline for maintenance.

To resume delivery, click the Resume Delivery button.

Figure 160: Resume Button
Stopping Historic Tasks
To stop a task while it is queued or during delivery, locate the task in the Tasks page
and click Stop Task.

Figure 161: Stop Task Button

Deleting Historic Tasks
When the historic task is complete, a notification is emailed to the logged in user.

Figure 162: Export Complete Email Notification
To delete a historic task, open the Tasks page and click the Delete Task link.

Figure 163: Delete Task Link
Click OK to confirm the export task deletion.

Figure 164: Delete Task Confirmation
WARNING:
Deleted tasks are not recoverable

12 Configuring Destinations
Amazon S3
Amazon Simple Storage Service (S3) is an online file storage web service. Amazon S3 is
available in the DataSift platform as a data destination. This section explains how to
configure Amazon S3 for the DataSift platform and how to configure the DataSift
platform to use Amazon S3.
Configuring Amazon S3 for DataSift
S3 is part of Amazon Web Services (AWS). To configure S3, either create a new AWS
account or log in using an existing AWS account.
Creating AWS Account
To create a new account, go to http://aws.amazon.com and click the Sign Up button.
Enter an email address and select I am a new user.

Figure 165: Creating AWS Account
Follow the instructions on the page to complete creation of a new account.
Click the Create account and continue button.
Credit card details are required to complete sign-up. At the time of writing, it is possible
to use S3 to store a small amount of data at no charge.

Add AWS services to the new account. The Add Now button adds all services, including
S3.

Figure 166: Add AWS Services
Complete identity verification by entering a phone number, waiting for a call and
entering the identification number shown on the page.

Figure 167: Identity Verification

Select an AWS support plan. It is possible to select a free plan.

Figure 168: Select Support Plan
When account configuration is complete, use the link to launch the AWS Management
Console.

Figure 169: Launch Console Link

Signing in to AWS Account
Go to http://aws.amazon.com and complete the authentication page.

Figure 170: AWS Sign In
When at the AWS console page, select the S3 service from the Storage and Content
Delivery group.

Figure 171: Selecting S3 Service

Creating Buckets
Storage is divided into buckets. Use the Create Bucket button to configure a bucket to
be used by the DataSift platform.

Figure 172: Creating a Bucket
Select a region in which the data is stored and provide a bucket name. Some regions
have restrictions on the characters which can be used in bucket names. Click Create to
continue.

Figure 173: Naming a Bucket

Creating Folders
The DataSift platform requires a bucket name and a folder name. Create a folder by
clicking Create Folder, entering a folder name and clicking the check mark.

Figure 174: Creating a Folder
NOTE:
If a folder is specified in a DataSift subscription which does not exist, the
DataSift platform will create the folder.
Locating Authorization Keys
The DataSift platform requires AWS security credentials to send data to S3. To locate
AWS security credentials, click on the username in the S3 console and click Security
Credentials.

Figure 175: Security Credentials Link in S3 Console

Either create and use AWS Identity and Access Management (IAM) users or click
Continue to Security Credentials. This example assumes IAM is not being used.

Figure 176: Continue to Security Credentials
NOTE:
AWS Best Practice is to use Identity and Access Management (IAM).
For simplicity, IAM is not used in this training module.
Expand the Access Keys heading and click Create New Access Key.

Figure 177: Create New Access Key
Copy the keys from the page or click Download Key File to download a CSV file of the
keys.

Figure 178: Copy or Download Access Key File (keys abbreviated in this example)

With the account configured with a bucket and folder, and a copy of the security
credentials available, the next step is to configure Amazon S3 as a destination in the
DataSift platform.
Configuring DataSift for Amazon S3
When Amazon S3 is configured in the web application, it remains available as a data
destination when creating or modifying tasks. When using the API to configure Amazon
as a destination, the S3 configuration information must be provided as each task is
defined.
Configuring Destination in Web Application
To configure Amazon S3 as a destination in the web application, open the Data
Destinations page, Browse Destinations and click the + symbol in the Amazon S3 box.

Figure 179: Adding Amazon S3 Destination

Complete the form with the information used when creating the Amazon S3 bucket and
folder.
WARNING:
If the combination of delivery frequency and max file size is not sufficient for
all the interactions in the stream, you may lose data. For example, the stream
may generate 15MB of data every 60 seconds but the frequency and max file
size of the destination may impose a limit of 10MB every 60 seconds.

Figure 180: Destination Details
Locate the Amazon S3 security credentials from the AWS console and enter them in the
Auth fields. Use the Test Connection button to ensure the destination is available and
the credentials are correct.

Click Create & Activate to continue.

Figure 181: Adding Keys and Testing Connection
The data source appears under My Destinations on the Data Destinations page.
Amazon S3 can be configured multiple times with different buckets, folders and delivery
rates.

Figure 182: My Destinations
When creating new tasks in the web application, the destination is available to select.

Figure 183: New Destination Available to Tasks

Configuring Destination in API
Amazon S3 destinations configured in the web application cannot be referenced when
creating new tasks programmatically. The REST API is used to configure Amazon S3 as a
destination every time a new task is created. The following information is needed:
Name Description Example
Bucket Name The bucket name in S3 siftersmithbucket1
Directory
Name
The folder name in S3 Starbucks1
Access Key The S3 access Key JHDSOWUREHDOSJDOA
Secret Key The S3 secret key hy64fgHJ85T43erOP045Fcvfd
Delivery
Frequency
How often to deliver data to S3 10 seconds
Max Delivery
Size
How much to deliver each time
(in bytes)
10485760
Username DataSift platform username siftersmith
API Key DataSift platform API Key
Available in the web application
8a8ee4e36c9d2171ccee4eec55
Stream Hash The hash for a compiled filter
The filter can be compiled in the web
application or programmatically
bfd56316e55d8d480b89c7

The push/ val i dat e endpoint of the REST API allows validation of S3 parameters. In
this example, the cur l command is used to send the arguments in a query string:
$ curl \
"https://api.datasift.com/v1/push/validate?output_type=s3&output_par
ams.bucket=siftersmithbucket1&output_params.directory=Starbucks1&out
put_params.acl=private&output_params.auth.access_key=AKIAJILMG&outpu
t_params.auth.secret_key=H05GAejDyS&output_params.delivery_frequency
=10&output_params.max_size=10485760&username=siftersmith&api_key=b36
697c55"

It is more readable when using the d option of the cur l command:
$ curl -X POST 'https://api.datasift.com/v1/push/validate' \
-d 'output_type=s3' \
-d 'output_params.bucket=siftersmithbucket1' \
-d 'output_params.directory=interactions' \
-d 'output_params.acl=private' \
-d 'output_params.auth.access_key=AKIAJILMG' \
-d 'output_params.auth.secret_key=H05GAejDyS' \
-d 'output_params.delivery_frequency=10' \
-d 'output_params.max_size=10485760' \
-H 'Authorization: siftersmith:b36697c55'
If successful, the following JSON object is returned:
{
" success" : t r ue,
" message" : " Val i dat ed successf ul l y"
}
Other training modules are available that show how to create, pause, resume, stop, and
monitor REST API delivery of streams to destinations.

Google BigQuery
One of the strengths of the DataSift platform is the ease with which streams can be sent
to destinations. This section explains how to configure Google BigQuery to accept data
from the DataSift platform and configure tasks to send output streams to Google
BigQuery.
Google BigQuery can be used by analysis tools such as the visual analysis tools provided
by Tableau. This destination is only available to Enterprise Edition customers.
REFERENCE:
DataSift documentation for Google BigQuery
http://dev.datasift.com/docs/push/connectors/bigquery
Google BigQuery
Google Cloud Platform is a Platform as a Service (PaaS). It comprises multiple services
for hosting applications, storing data, and computing. Google BigQuery is the Google
Cloud Platform service for querying billions of rows of data within tables.
Google BigQuery is a web service and REST API.
When using BigQuery as a data destination, batches of interactions are queued and
sent at 90 second intervals to BigQuery tables. Each interaction is a new table row.
The following example shows a Structured Query Language (SQL) query to a dataset
which could have billions of rows.

Figure 184: BigQuery Example

Google BigQuery Terminology
Google BigQuery uses the following terminology:
Projects
The top-level container for Google cloud services is a project. It stores information
about authentication and billing. Each project has a name, an ID, and a number.
Datasets
Datasets allow for organization and access control to multiple tables.
Tables
BigQuery data is held in tables along with a corresponding table schema to describe the
fields. When used with DataSift, the Schema is automatically derived from the JSON
objects in the stream.
Jobs
Jobs are actions which are executed by BigQuery. For example, to query a table for
particular records. Jobs are executed synchronously and may take a long time to
complete.

Configuring Google BigQuery for DataSift
Logging in to Google
Go to http://developers.google.com/console and log in using existing credentials or use
the Create an account link to create a new account.

Figure 185: Log in or Create an Account
To proceed, the Terms of Service box must be selected before clicking Continue.

Figure 186: Terms and Conditions

Creating a Project
Projects are the top-level container for all Google Cloud Platform services. The console
opens on a pre-configured default Project. To create a new project, click the API Project
link which returns to the top-level menu.

Figure 187: API Project Link
Click the CREATE PROJECT button

Figure 188: Create Project

When using a new account or using an existing account with Google Cloud services for
the first time, SMS verification may be required. Click Continue and follow the
instructions to receive a code number on a mobile phone which must be entered into
the web page.

Figure 189: SMS Account Verification
When SMS verification is complete, enter a Project name. The Project IDs are
generated automatically. Click the refresh button in the Project ID field to create new
IDs. Click Create to continue.

Figure 190: Project name and ID

From the project overview page, make note of the Project ID and Project Number.
Both are required when configuring the DataSift platform to use BigQuery as a
destination.

Figure 191: New Project Overview
Enabling Billing
A small amount of data may be stored in a Cloud Datastore and used with BigQuery for
no charge. However, this is not available until billing information has been completed.
Complete billing information by clicking Settings.

Figure 192: Enabling Project Billing
NOTE:
Billing information must be entered for each project separately.

Complete the billing information form. When complete the Enable billing button turns
into a Disable billing button.

Figure 193: Enabled Billing
Retrieving Authentication Details
The DataSift platform requires authentication credentials to send data to a BigQuery
Table. This is done with public and private keys. To generate the credentials and
public/private keys, click APIs & auth then Credentials from the project summary.
Then click the CREATE NEW CLIENT ID button.

Figure 194: Retrieving Credentials

Select Service account and click Create client ID.

Figure 195: Create Client ID
The new public and private keys are created. The private key is automatically
downloaded with a .p12 filename extension.

Figure 196: p12 File Downloaded
The private key password is also displayed. Make a note of the password as it is not
shown again. When ready, click Okay, got it.

Figure 197: Public/Private Key Generated

The new credentials are displayed. Make a note (or download the JSON object) of the
Client ID and Email address.

Figure 198: Service Account Credentials
The following credentials are required to configure BigQuery as a data destination:
Client ID
Email address
p12 key file
Configuring Datasets
From the project summary page, click the BigQuery link. A new browser window opens
for https://bigquery.cloud.google.com/.

Figure 199: Click BigQuery Link

The BigQuery console opens with example datasets. A new dataset must be configured
which is used by the DataSift platform when sending the stream of interactions. The
stream automatically creates a new table in the dataset.
Click the project menu and select Create new dataset.

Figure 200: Click Create new dataset
Enter a Dataset ID using only letters, numbers, and underscore.

Figure 201: Enter Dataset ID

Configuring DataSift Web Application for Google
BigQuery
The dataset and credential information created in the previous section are used to
configure the BigQuery destination in the DataSift platform. This section looks at the
web application configuration.
Configuring BigQuery Destination
Open the DataSift platform web application by logging in at datasift.com. From the Data
Destinations page, click the '+' symbol on the Google BigQuery tile.

Figure 202: Select Google BigQuery Destination
Complete the New Google BigQuery Destination form with the following information:
Label
o This name is only used in the web application. It is possible to define
multiple BigQuery destinations to different projects, datasets and tables.
Use this name to differentiate multiple instances.
Table ID
o The name of a table which is created automatically in the chosen dataset.
Whitespace is not permitted in table names.
Dataset ID
o The dataset in which to create a table. This must exist in BigQuery.
Project ID
o The project ID or project number.
Client ID
o The client ID generated by the creation of new service account
credentials.

Service account email address
o The email address generated by the creation of new service account
credentials.
P12 Key file
o The private key file which was automatically downloaded by creation of
new service account credentials.

Figure 203: New Google BigQuery Destination Form
The new destination appears in My Destinations. Notice how multiple instances of
BigQuery destinations are referenced by their label.

Figure 204: New Data Destination Saved

Configuring Stream Tasks Using BigQuery
From the Streams page, create or select a stream to use with the Google BigQuery
destination. A task is created using a live stream recording or historic data. In this
example, a live recording is used. Select Record Stream from the stream summary.

Figure 205: Click Record Stream
Configure start & end times and give the recording a name. Click Continue.

Figure 206: Select Task start and end times

Select a destination. The new BigQuery destination is available for selection. Click
Continue.

Figure 207: Select BigQuery Destination
Check the details and confirm by clicking Start Task.

Figure 208: Start Task

The stream sends interactions to Google BigQuery which automatically creates a table
using the name provided in the destination configuration. In this example, the
Starbucks_Table1 table has been created and the schema is displayed.
To view the table, expand the dataset and click on the table name.

Figure 209: New Table Schema
On the Table Details page, click the Details button to see the size of the table in bytes
and rows.

Figure 210: Table Details
If an end date and time were not specified, remember to Pause or Stop the recording
task in the DataSift web application when enough records have been received.

Configuring DataSift API for Google BigQuery
Google BigQuery is available as a data destination when using the push endpoints in
the REST API. When using Google BigQuery as a destination, the p12 key file must be
Base64 encoded and then URL encoded in order to remove URL-unsafe characters. This
can be done with the following p12t obi gquer y Python script:
i mpor t ar gpar se
i mpor t base64
i mpor t ur l l i b
i mpor t sys
# par se ar gument s
par ser = ar gpar se. Ar gument Par ser ( descr i pt i on=' Conver t a . p12 f i l e
i nt o a st r i ng a Googl e Bi gQuer y Push connect or can use. ' )
par ser . add_ar gument ( ' - f ' , r equi r ed=Tr ue, act i on=' st or e' , dest =' f i n' ,
hel p=' t he name of t he . p12 f i l e' )
ar gs = par ser . par se_ar gs( )
wi t h open( ar gs. f i n, ' r ' ) as f :
p12 = f . r ead( )
sys. st dout . wr i t e( ur l l i b. quot e( base64. b64encode( p12) ) )
Figure 211: p12tobigquery Python Script
REFERENCE:
Link to Python script https://gist.github.com/paulsmart/8197435
In the following example, the push/ cr eat e endpoint is being used with an existing
stream hash. Command substitution is being used to process the key file encoding
script.
$ curl -X POST 'https://api.datasift.com/v1/push/create' \
-d 'name=googlebigquery' \
-d 'hash= c0e53815905869ac96aa80358' \
-d 'output_type=bigquery' \
-d 'output_params.project_id=617232322419' \
-d 'output_params.dataset_id=Dataset_Starbucks' \
-d 'output_params.table_id=Starbucks_Table1' \
-d 'output_params.auth.client_id=0000.apps.googleusercontent.com' \
-d 'output_params.auth.service_account=0000@gserviceaccount.com' \
-d "output_params.auth.key_file=`python ./p12tobigquery.py -f
7bdb72743e7da7605fef5c-privatekey.p12`"
-H 'Authorization: datasift-user:your-datasift-api-key'

See push delivery training modules for more information on the push API endpoints.
REFERENCE:
REST API endpoints http://dev.datasift.com/docs/rest-api

Querying Data in BigQuery
From the Table Details page in the BigQuery console, click Query Table.

Figure 212: Click Query Table Button
An editor opens at the top of the page. BigQuery uses Structured Query Language (SQL)
to query the table. Use the SELECT clause to select fields from the interactions table.
The table name is already defined in the FROMclause. Use the LI MI T clause to limit the
number of records returned. Click RUN QUERY to run the query statement.

Figure 213: Example Query
NOTE:
Databases in BigQuery are append-only.

REFERENCE:
Query language documentation
https://developers.google.com/bigquery/query-reference

In this example, the first five records are returned showing only the author name. The
query took 1.2 seconds of elapsed time to complete.

Figure 214: Example Query Output
Use the Download as CSV and Save as Table buttons to download the output as a CSV
file or create a new table for further querying.
Query History
All queries are saved and available for editing or running by clicking Query History.

Figure 215: Recent Query History

Deleting Google Cloud Projects
Ensure the streaming task from DataSift is paused, stopped or deleted.

Figure 216: Paused Task
In the BigQuery project page, select Delete table from the table menu.

Figure 217: Delete Table

Select Delete dataset from the dataset menu.

Figure 218: Delete Dataset
Return to the project summary. From the settings page, click the Disable billing button.

Figure 219: Disable Billing

From the developer's console, select the project and click Delete. It is necessary to
confirm the deletion by typing the project ID when prompted.

Figure 220: Delete Project
The deletion is scheduled and may take several hours to complete.



14 Configuring Push Delivery API
Sending an output stream of filtered interactions to a destination is called a recording
task in the web interface. When using the REST API, it is called a push delivery
subscription. This section explains how to create push delivery subscriptions, monitor
them throughout their duration, and finally stop them.
Push Delivery Workflow
After authentication credentials have been located, the destination is validated and a
push delivery subscription created. While running, the subscription is monitored using
l og and get endpoints, or paused and resumed while data is buffered. Finally it is
stopped and deleted.

Figure 221: Push Delivery Workflow
Locating API Credentials
Access to the REST API is controlled by Username and API Key. Both of these are
available in the web application.

Figure 222: Locating API Credentials (API Key truncated in example)

Validating Push Destinations
Every time a new push subscriptions is created in the REST API, the configuration details
of the destination are required. Using Amazon S3 as an example, the following
information is needed:
Bucket
Name
The bucket name in S3 siftersmithbucket1
Directory
Name
The folder name in S3 Starbucks1
Access Key The S3 access Key JHDSOWUREHDOSJDOA
Secret Key The S3 secret key hy64fgHJ85T43erOP045Fcvfd
Delivery
Frequency
How often to deliver data to S3 10 seconds
Max
Delivery Size
How much to deliver each time 10485760
Username DataSift platform username siftersmith
API Key DataSift platform API Key
Available in the web application
8a8ee4e36c9d2171ccee4eec55

The val i dat e endpoint of the REST API allows validation of the destination parameters.
In this example, the cur l command is used to send the arguments in a query string:
$ curl \
"https://api.datasift.com/v1/push/validate?output_type=s3&output_par
ams.bucket=siftersmithbucket1&output_params.directory=Starbucks1&out
put_params.acl=private&output_params.auth.access_key=AKIAJILMG&outpu
t_params.auth.secret_key=H05GAejDyS&output_params.delivery_frequency
=10&output_params.max_size=10485760&username=siftersmith&api_key=b36
697c55"

It is more readable when using the d option of the cur l command:
$ curl -X POST 'https://api.datasift.com/v1/push/validate' \

If successful, the following JSON object is returned:
{
" message" : " Val i dat ed successf ul l y"
}

REFERENCE:
Documentation of the push/validate endpoint:
http://dev.datasift.com/docs/api/1/pushvalidate
Creating Push Subscriptions
The push/ cr eat e endpoint uses the same syntax as the val i dat e endpoint but also
requires a stream hash. The stream hash is taken from the web application or the JSON
response to an API compi l e request.
Stream Hash The hash for a compiled filter
The filter is compiled in the web
application or programmatically
bf d56316e55d8d480b89c7

Locating Stream Hashes
In the web application go to the streams page and click Consume via API.

Figure 223: Consume via API Button

The data shown on the Consume this stream via API screen includes:
1. Stream Hash - A unique reference for the stream filter
2. API Key - An authentication key is currently required to make any API calls
3. Explore our Libraries - Examples of how to use the API in multiple languages

Figure 224: API Information in the Web Application
When compiling filters programmatically, the stream hash is returned in the JSON object
from a compi l e endpoint. In the following example, a simple CSDL filter is compiled
using cur l and the hash is returned.
$ curl \
"https://api.datasift.com/v1/compile?csdl=twitter.text%20contains%20
%22Starbucks%22&username=siftersmith&api_key=b366978a8ee55"
{
" hash" : " bf d56316e55d8d480b89c73359e6d" ,
" cr eat ed_at " : " 2013- 11- 08 12: 37: 31" ,
" dpu" : " 0. 1"
}

REFERENCE:
Documentation of the compile endpoint:
http://dev.datasift.com/docs/api/1/compile

Writing push/create Calls
The push/ cr eat e call is made using the destination information, API credentials, and
stream hash. The following example uses the cur l utility to pass these arguments to
the API endpoint.
$ curl -X POST 'https://api.datasift.com/v1/push/create' \
-d 'name=myamazonsubscription' \
-d 'hash=bfd56316e55d8d480b89c73359e6d' \
-H 'Authorization: paulsmart:b36697c55'

If the push/ cr eat e is successful, then a JSON object similar to the one shown is
received:
{
" i d" : " 3a9d78af c28d8c71262d1d5f 4e280c9f " ,
" out put _t ype" : " s3" ,
" name" : " myamazonsubscr i pt i on" ,
" cr eat ed_at " : 1384183625,
" hash" : " bf d56316e55d8d480b89c73359e6d" ,
" hash_t ype" : " st r eam" ,
" out put _par ams" : {
" bucket " : " si f t er smi t hbucket 1" ,
" di r ect or y" : " i nt er act i ons" ,
" acl " : " pr i vat e" ,
" del i ver y_f r equency" : 60,
" max_si ze" : 10485760
},
" st at us" : " act i ve" ,
" l ast _r equest " : nul l ,
"l ast _success" : nul l ,
" r emai ni ng_byt es" : nul l ,
" l ost _dat a" : f al se,
" st ar t " : 1384183625,
" end" : 0
}

REFERENCE:
Documentation of the push/create endpoint, including how to use scheduled
start: http://dev.datasift.com/docs/api/1/pushcreate

Checking Push Subscriptions
The push/ get endpoint is used to check the status of a push delivery subscription. With
no arguments, the status of all subscriptions is returned in the JSON object.
$ curl -X POST 'https://api.datasift.com/v1/push/get' \
{
" subscr i pt i ons" : [
{
" i d" : " 3a9d78af c28d8c71262d1d5f 4e280c9f " ,
" out put _t ype" : " s3" ,
" name" : " myamazonsubscr i pt i on" ,
" cr eat ed_at " : 1384183625,
" user _i d" : 28619,
" hash" : " bf d56316e55d8d480b89c73359e6d" ,
" bucket " : " si f t er smi t hbucket 1" ,
" di r ect or y" : " i nt er act i ons" ,
" acl " : " pr i vat e" ,
" del i ver y_f r equency" : 60,
" max_si ze" : 10485760
},
" l ast _r equest " : 1384183657,
" l ast _success" : 1384183658,
" st ar t " : 1384183625,
" end" : nul l
}
] ,
" count " : 1
}

NOTE:
The remaining_bytes field (number of bytes buffered ready to send) is always
null when calling push/get on all subscriptions. Specify an individual
subscription to see a remaining_byte value.

Notice the following two attributes have changed since the push/ cr eat e call:
Attribute Description Value when
created
Value now
LAST_REQUEST The time of the most recent Push
delivery request sent to the associate
data destination.
A Unix timestamp.
null 1384183657
LAST_SUCCESS The time of the most recent successful
delivery.
A Unix timestamp.
null 1384183658

To get the status of a particular subscription, use the i d argument:
$ curl -X POST 'https://api.datasift.com/v1/push/get' \
-d 'id=d468655cfe5f93741ddcd30bb309a8c7' \

REFERENCE:
Documentation of the push/get endpoint:
http://dev.datasift.com/docs/api/1/pushget
Retrieving Push Subscription Logs
Push subscriptions may return messages advising the delivery is complete or an error
has occurred. The push/ l og endpoint is the single most important endpoint to retrieve
these messages and troubleshoot subscription problems. The minimum information
required is just the API credentials. This retrieves log information for all subscriptions.
$ curl -X POST 'https://api.datasift.com/v1/push/log' \

Example Output:
{
" count " : 4,
" l og_ent r i es" : [
{
" subscr i pt i on_i d" : " 4b7ce39a5292b96ccd98f 69324b0dc99" ,
" r equest _t i me" : 1344859261,
"message": "The delivery has completed"
},
{
" subscr i pt i on_i d" : " 13ba92f 6784da5e60b82f 532f 43c7d17" ,
" success" : f al se,
" r equest _t i me" : 1344855061,
"message": "The delivery was paused for too long"
},
{
" subscr i pt i on_i d" : " 4e097f 46ef 0dd2e8e3f 25f 84dddda775" ,
" r equest _t i me" : 1344630221,
"message": "Stopped due to too many failed delivery
attempts"
},
{
" subscr i pt i on_i d" : " 4e097f 46ef 0dd2e8e3f 25f 84dddda775" ,
" r equest _t i me" : 1344630221,
"message": "The endpoint returned a 500 internal server
error"
}
]
}

Subscription status changes are also provided:
"message": "The status has changed to: finished"

The following message advises that data is not being consumed fast enough and is
being lost:
" message": "Some data remained in the queue for too long and so was
expired (consumer too slow)"

Pausing Push Subscriptions
It is possible to pause a push subscription for up to one hour. This may be required if
the destination requires scheduled downtime. The data is buffered and delivered when
the push subscription is resumed. Use the push/ pause endpoint with a subscription i d
as the argument.
$ curl -X POST 'https://api.datasift.com/v1/push/pause' \
-d 'id=d468225f93741ddcd30bb309a8c7' \

The returned JSON object includes the following status:
" st at us" : " paused" ,

REFERENCE:
Documentation on the push/pause endpoint:
http://dev.datasift.com/docs/api/1/pushpause
Resuming Push Subscriptions
Paused subscriptions should be resumed as quickly as possible to prevent buffered
data expiring and being lost. To resume a paused push subscription, use the
push/ r esume endpoint with the i d of the push subscription as an argument.
$ curl -X POST 'https://api.datasift.com/v1/push/resume' \
-d 'id=d468225f93741ddcd30bb309a8c7' \

The returned JSON object includes the following status:

REFERENCE:
Documentation on the push/resume endpoint:
http://dev.datasift.com/docs/api/1/pushresume

Stopping Push Subscriptions
To stop an active push subscription, use the push/ st op endpoint with the i d of the
push subscription as an argument.
$ curl -X POST 'https://api.datasift.com/v1/push/stop' \
-d 'id=d468225f93741ddcd30bb309a8c7' \

Stopped subscriptions cannot be restarted or resumed. The returned JSON object
includes the following status:
" st at us" : " f i ni shi ng" ,

REFERENCE:
Documentation on the push/stop endpoint:
http://dev.datasift.com/docs/api/1/pushstop
Deleting Push Subscriptions
Deleting is not necessary as finished push subscriptions are automatically deleted after
two weeks.
If immediate deletion is required, use the push/ del et e endpoint with the i d of the
push subscription as an argument. Any undelivered data in the buffer will not be
delivered but will be charged for. To avoid data loss, stop the push subscription and use
the push/ get and endpoint to ensure the status equals delivered.
$ curl -X POST 'https://api.datasift.com/v1/push/delete' \
-d 'id=d468225f93741ddcd30bb309a8c7' \

There is no JSON object returned. The response is an HTTP 204 status code. Deleted
subscription cannot be recovered and all logs are deleted immediately.
REFERENCE:
Documentation on the push/delete endpoint:
http://dev.datasift.com/docs/api/1/pushdelete

MySQL
One of the strengths of the DataSift platform is the ease with which streams can be sent
to destinations. This section explains how to configure a MySQL instance in Amazon
Web Services to accept data from the DataSift platform.
MySQL is used directly by some analysis tools such as those provided by Tableau. This
destination is only available to Enterprise Edition customers.
REFERENCE:
DataSift documentation for MySQL
http://dev.datasift.com/docs/push/connectors/mysql
Configuring Amazon RDS
Amazon RDS is a cloud relational database service which allows configuration of a
MySQL database. Open the Amazon AWS Console and click RDS.
https://console.aws.amazon.com/console/

Figure 225: Click RDS

Creating RDS Instance
RDS allows multiple database instances of different types. To create a new instance
using MySQL, click the Launch a DB Instance button.

Figure 226: Launch an RDS Instance

From the list of database types, click Select next to MySQL.

Figure 227: Select MySQL
Choose if this instance will be used for production purposes. This example assumes it
will not be used for production.

Figure 228: Select production purpose
Choose the appropriate parameters for your database. This example uses the latest DB
engine version on a micro instance class with 10GB of storage.

Enter database instance identifier (the instance name), username and password, and
make note of these details as they will be required later.

Figure 229: DB Instance Details
The RDS instance is created with one database. More databases can be created within
the RDS instance later.
Enter a name for the first database and choose the port. 3306 is the default port and
would only need to be changed if this port was blocked by a firewall.
The VPC is a Virtual Private Cloud and can be thought of as the virtual network with a
firewall to which the database is connected. Multiple VPCs can be configured with
different firewall rules. In this example, the Default VPC is used for the subnet group
and security group.
NOTE:
Rules to allow database queries through the VPC firewall will be added later.

Ensure the instance is Publicly Accessible.

Figure 230: Additional Config
Select automatic backups and a maintenance window if required. In this example,
automated backups are disabled.

Figure 231: Select Management Options

Review the configuration and click Launch DB Instance.

Figure 232: Review

The following confirmation is displayed.

Figure 233: Creation Confirmation
Allow up to 10 minutes for the new RDS instance to be created. In this example, the
Status is still creating.

Figure 234: Instance List

Click on the arrow at the start of the row to see the database details.

Figure 235: Instance Details
Configuring Network Security
The default VPC security group is blocking access to the database instance. To add a
rule allowing access, click the name of the security group in the instance details page.

Figure 236: Security Group
The Security Groups page in the EC2 dashboard opens with a filter for the security
group being used for the database instance.

Click the Inbound tab, followed by the Edit button to create a new rule which allows
traffic from the Internet.

Figure 237: Security Group Inbound Rules
Click Add Rule

Figure 238: Add inbound rule
Select MySQL in the Type and specify the permitted source of traffic for the database
instance. If access should be permitted from anywhere then select Anywhere
(0.0.0.0/0). Otherwise use CIDR notation to specify an individual IP address (e.g.
52.127.88.102/32) or network (e.g. 52.0.0.0/16).

Figure 239: Configure inbound rule

Verifying Network Security Configuration
Locate the database endpoint and port number from the instance details.

Figure 240: Instance Endpoint
Use the mysql command line utility to verify access is available. This example is being
run from an Amazon EC2 instance, attempting to connect to the Amazon RDS instance.
$ mysql -h ds-instance1.cu1rwk85ieme.us-west-2.rds.amazonaws.com \
-P 3306 -u training --password=xxxxxxxx
Wel come t o t he MySQL moni t or . Commands end wi t h ; or \ g.
Your MySQL connect i on i d i s 15187
Ser ver ver si on: 5. 6. 13 MySQL Communi t y Ser ver ( GPL)
Copyr i ght ( c) 2000, 2013, Or acl e and/ or i t s af f i l i at es. Al l r i ght s
r eser ved.
Or acl e i s a r egi st er ed t r ademar k of Or acl e Cor por at i on and/ or i t s
af f i l i at es. Ot her names may be t r ademar ks of t hei r r espect i ve
owner s.
Type ' hel p; ' or ' \ h' f or hel p. Type ' \ c' t o cl ear t he cur r ent i nput
st at ement .

mysql > quit

Configuring Databases
Listing Databases
The new RDS instance has several pre-created databases; including the database
named during instance configuration.
Use show dat abases; to show all databases in an RDS instance. In this example,
MyDat abase is the database created by the Amazon RDS configuration wizard.
mysql > show databases;
+- - - - - - - - - - - - - - - - - - - - +
| Dat abase |
+- - - - - - - - - - - - - - - - - - - - +
| i nf or mat i on_schema |
| MyDat abase |
| i nnodb |
| mysql |
| per f or mance_schema |
| t mp |
+- - - - - - - - - - - - - - - - - - - - +

The only database available for use as a destination is MyDat abase. The others are
internal databases used by the database server.
Creating Databases
Use the cr eat e dat abase command to create more databases. Each database can be
used as a separate DataSift destination.
mysql > create database banana;
Quer y OK, 1 r ow af f ect ed ( 0. 00 sec)

+- - - - - - - - - - - - - - - - - - - - +
| Dat abase |
+- - - - - - - - - - - - - - - - - - - - +
| MyDat abase |
| banana |
| i nnodb |
| mysql |
| t mp |
+- - - - - - - - - - - - - - - - - - - - +

Deleting Databases
The dr op dat abase command is used to delete a database. In this example, a
database called banana is deleted.
+- - - - - - - - - - - - - - - - - - - - +
| Dat abase |
+- - - - - - - - - - - - - - - - - - - - +
| MyDat abase |
| banana |
| i nnodb |
| mysql |
| t mp |
+- - - - - - - - - - - - - - - - - - - - +
7 r ows i n set ( 0. 00 sec)

mysql > drop database banana;
Quer y OK, 0 r ows af f ect ed ( 0. 00 sec)

+- - - - - - - - - - - - - - - - - - - - +
| Dat abase |
+- - - - - - - - - - - - - - - - - - - - +
| MyDat abase |
| i nnodb |
| mysql |
| t mp |
+- - - - - - - - - - - - - - - - - - - - +

Configuring Database Tables
Before the DataSift platform sends interactions to the database, a table must be created
with the appropriate fields. The table and fields are created with SQL commands.
The interaction fields are mapped to the database fields with a mapping file (also known
as an . i ni file).
The configuration of a table depends on the information which is stored. It is possible to
create a simple database table which inly stores one attribute of each interaction, or a
more complex one which stores a large number of attributes.
Listing Database Tables
Use the show dat abases; command to list the databases then the use command to
make one database the focus of the following commands.
Use the show t abl es; command to list all tables in the database which is currently in
use.
+- - - - - - - - - - - - - - - - - - - - +
| Dat abase |
+- - - - - - - - - - - - - - - - - - - - +
| MyDat abase |
| appl e |
| i nnodb |
| mysql |
| t mp |
+- - - - - - - - - - - - - - - - - - - - +
mysql > use MyDatabase;
Dat abase changed

mysql > show tables;
+- - - - - - - - - - - - - - - - - - - - - - +
| Tabl es_i n_MyDat abase |
+- - - - - - - - - - - - - - - - - - - - - - +
| hasht ags |
| ment i ons |
| t wi t t er |
+- - - - - - - - - - - - - - - - - - - - - - +

Creating Tables CLI
The create table command is used to create new tables. The following example shows
how to create a minimal table with only two fields.
mysql > create database banana;
Quer y OK, 1 r ow af f ect ed ( 0. 00 sec)

mysql > use banana
Dat abase changed
mysql > CREATE TABLE twitter ( interaction_id VARCHAR(64) PRIMARY
KEY, content TEXT DEFAULT NULL );
Quer y OK, 0 r ows af f ect ed ( 0. 08 sec)

mysql > show tables;
+- - - - - - - - - - - - - - - - - - +
| Tabl es_i n_banana |
+- - - - - - - - - - - - - - - - - - +
| t wi t t er |
+- - - - - - - - - - - - - - - - - - +
1 r ow i n set ( 0. 00 sec)

There are many examples of the SQL commands required to create tables for
interactions in a DataSift GitHub repository.
Link
https://github.com/datasift/push-schemas
An excerpt of an SQL schema which is used to create a table for Twitter data is shown
below.
Example files:
https://github.com/datasift/push-
schemas/blob/master/sources/twitter/mysql.sql
CREATE TABLE t wi t t er (
i nt er act i on_i d VARCHAR( 64) PRI MARY KEY,
i nt er act i on_t ype VARCHAR( 64) NOT NULL,
<output omitted>
geo_l at i t ude DOUBLE DEFAULT NULL,
geo_l ongi t ude DOUBLE DEFAULT NULL,
cont ent TEXT DEFAULT NULL,
t wi t t er _l ang VARCHAR( 64) DEFAULT NULL,
<output omitted>
)

Configuring Mapping
The DataSift platform sends interactions which contain many fields to the MySQL
database. It is necessary to map the interaction fields to the database fields using a
mapping file.
Example mapping files are available on DataSift's GitHub repository.
Example files:
https://github.com/datasift/push-
schemas/blob/master/sources/twitter/mapping.ini
In this example, a Twitter mapping file is being used. An excerpt of the file is shown
below. It contains mappings of fields to interaction attributes and iterators which
process arrays of information from the interaction.
[ t wi t t er ]
i nt er act i on_i d = i nt er act i on. i d
i nt er act i on_t ype = i nt er act i on. t ype
cr eat ed_at = i nt er act i on. cr eat ed_at ( dat a_t ype: dat et i me, t r ansf or m:
dat et i me)
aut hor _name = i nt er act i on. aut hor . name
aut hor _user name = i nt er act i on. aut hor . user name
<output omitted>
t wi t t er _i d = t wi t t er . i d
geo_l at i t ude = i nt er act i on. geo. l at i t ude
geo_l ongi t ude = i nt er act i on. geo. l ongi t ude
cont ent = i nt er act i on. cont ent
<output omitted>
r et weet _count = t wi t t er . r et weet . count

[ hasht ags : i t er = l i st _i t er at or ( i nt er act i on. hasht ags) ]
dat et i me)
hasht ag = : i t er . _val ue

[ ment i ons : i t er = l i st _i t er at or ( i nt er act i on. ment i ons) ]
dat et i me, t r ansf or m: dat et i me)
ment i on = : i t er . _val ue

Documentation:
INI Files http://dev.datasift.com/docs/push/connectors/ini
Download and save the example mapping file.

Encoding Mapping Files
The contents of the file can be used with the DataSift API in clear text, or base64
encoded.
When encoded, the mapping file has no whitespace which makes it easier to handle.
The following command is an example of encoding a mapping file and saving the result
to a new file.
$ base64 -w 0 twittermapping.ini > twittermapping64.ini

NOTE:
The encoded file does not end with a CR or LF

Configuring DataSift Destination (API)
The first stage in using MySQL as a destination is to create the subscription using the
push/ cr eat e endpoint.
The push/ cr eat e endpoint requires several parameters:
push/create parameter Description/Example
name
A user-specified name to identify the subscription
hash
The hash of a filter
output_type mysql
output_params.host
The endpoint URL from Amazon RDS instance
details page (excluding the port number)
output_params.port
The port number
Usually 3306
output_params.schema
The clear text or base64 encoded mapping file
output_params.database
The database name.
A RDS instance can have multiple databases
output_params.auth.username
The username used when creating the RDS
instance
output_params.auth.password
The password used when creating the RDS
instance
authorization
API authorization credentials

Creating a Subscription
The following example shows the push/create endpoint being used to create a
subscription with an Amazon RDS MySQL database destination.
$ curl -sX POST https://api.datasift.com/v1/push/create \
-d name=mysql \
-d hash=dbdf49e22102ed01e945f608ac05a57e \
-d output_type=mysql \
-d output_params.host=ds-instance1.cu1rwk85ieme.us-west-
2.rds.amazonaws.com \
-d output_params.port=3306 \
-d 24uaWQKaW50ZXJhY3Rpb25fdHlwZSA9IGludGVyYWN0aW9uLnR5cGUKY3
JlYXRlZF9hdCA9IGludGVyYWN0aW9uLmNyZWF0ZWRfYXQgKGRhdGFfdHlwZT
<output omitted>
0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1lLCB0cmFuc2Zvcm06IGRhdGV0aW1l
KQptZW50aW9uID0gOml0ZXIuX3ZhbHVlCgo= \
-d output_params.database=MyDatabase \
-d output_params.auth.username=training \
-d output_params.auth.password=xxxxxxxx \
-H 'Authorization: paulsmart:84768663b04a62ac7d4ac43'

If successful, a JSON object is returned with a status of active.
{
" cr eat ed_at " : 1397565784,
" end" : 0,
" hash" : " dbdf 49e22102ed01e945f 608ac05a57e" ,
" i d" : " 4bb94a4b68a1ec3e9d63af 61802201cb" ,
" l ast _r equest " : nul l ,
" l ast _success" : nul l ,
" name" : " mysql " ,
" dat abase" : " MyDat abase" ,
" host " : " ds- i nst ance1. cu1r wk85i eme. us- west -
2. r ds. amazonaws. com" ,
" por t " : 3306,
" schema" :
" W3R3aXR0ZXJ dCml udGVyYWN0aW9uX2l kI D0gaW50ZXJ hY3Rpb24uaWQKaW50ZXJ hY3R
pb25f dHl wZSA9I Gl udGVyYWN0aW9uLnR5cGUKY3J l YXRl ZF9hdCA9I Gl udGVyYWN0aW9
uLmN
<output omitted>
6I GRhdGV0aW1l LCB0cmFuc2Zvcm06I GRhdGV0aW1l KQpt ZW50aW9uI D0gOml 0Z
XI uX3ZhbHVl Cgo="
},
" out put _t ype" : " mysql " ,
" st ar t " : 1397565784,
"status": "active",
" user _i d" : 28619
}

Documentation
http://dev.datasift.com/docs/push/connectors/mysql

Monitoring a Subscription
Use the push/ get endpoint to monitor running or paused subscriptions. This example
displays the status of all active subscriptions. Use the i d parameter to specify individual
subscriptions.
$ curl -sX POST https://api.datasift.com/v1/push/get \
-H 'Authorization: paulsmart:84768663b04a62ac7d4ac'
{
" count " : 1,
" subscr i pt i ons" : [
{
" cr eat ed_at " : 1397565784,
" end" : nul l ,
" l ast _r equest " : 1397566364,
" l ast _success" : 1397566360,
" por t " : 3306,
" schema" :
uLmN
<output omitted>
nRl cmFj dGl vbi 5j cmVhdGVkX2F0I ChkYXRhX3R5cGU6I GRhdGV0aW1l LCB0cmFuc2Zvc
m06I GRhdGV0aW1l LCB0cmFuc2Zvcm06I GRhdGV0aW1l KQpt ZW50aW9uI D0gOml 0ZXI uX
3ZhbHVl Cgo="
},
" st ar t " : 1397565784,
" user _i d" : 28619
}
]
}

Documentation
http://dev.datasift.com/docs/api/1/pushget

Stopping a Subscription
Use the push/ st op endpoint with the subscription ID to stop a running subscription.
$ curl -sX POST https://api.datasift.com/v1/push/stop \
-d id=4bb94a4b68a1ec3e9d63af61802201cb \
-H 'Authorization: paulsmart:84768663b04a62ac7d4a'
{
" cr eat ed_at " : 1397565784,
" end" : 1397566402,
" l ast _r equest " : 1397566400,
" l ast _success" : 1397566401,
" por t " : 3306,
" schema" :
uLmN
<output omitted>
nRl cmFj dGl vbi 5j cmVhdGVkX2F0I ChkYXRhX3R5cGU6I GRhdGV0aW1l LCB0cmFuc2Zvc
m06I GRhdGV0aW1l LCB0cmFuc2Zvcm06I GRhdGV0aW1l KQpt ZW50aW9uI D0gOml 0ZXI uX
3ZhbHVl Cgo="
},
" st ar t " : 1397565784,
" st at us" : " f i ni shi ng" ,
" user _i d" : 28619
}

Documentation:
http://dev.datasift.com/docs/api/1/pushstop

Deleting a Subscription
Use the push/ del et e endpoint with an ID to delete the specified subscription. No JSON
object is returned, just an HTTP status 204
$ curl -sX POST -w 'Status: %{http_code}\n' \
https://api.datasift.com/v1/push/delete \
-d id=4bb94a4b68a1ec3e9d63af61802201cb \
-H 'Authorization: paulsmart:84768663b04a62ac7d4ac4350cb'

St at us: 204

Documentation:
http://dev.datasift.com/docs/api/1/pushdelete

Configuring DataSift Destination (Web Application)
The details of the MySQL database are entered in the DataSift platform once and then
used many times.
Adding MySQL Destination
Login to the DataSift web application and open the Data Destinations page. From the
Browse Destinations page, click the + on the MySQL destination.

Figure 241: Add new MySQL Destination
Complete the form using the following information.
Field Description
Label This name is used to refer to this destination in the web
application
Host A URL for the database instance. In Amazon RDS, this is called the
endpoint.
Port Port number, 3306 is usual
Database Name The database name (MyDatabase in previous examples)
Mappings Select a interaction field to database field mapping file (usually
ending .ini)
Username The MySQL user with permissions to create
Password Corresponding password

After entering the fields, click Test Connection to test the DataSift platform can connect
and authenticate with the database instance.

Figure 242: MySQL Destination setup form
When the connection test is OK, click Create& Activate.
The new destination is saved and available for all future recordings and historic tasks.

Figure 243: Listing of New MySQL Destination



16 Monitor Usage & Billing
It is important to be aware of how billing is designed and to monitor usage of the
platform to optimize usage and prevent unexpected bills. This module describes the
billing model and shows how to monitor usage in detail and how to set usage limits.
Subscription Model
The subscription billing model is a monthly package that includes a pre-set number of
DPUs which are credited at the beginning of each month and consumed during the
course of that month.
Delivered interactions also have a data license cost. This is separate to DPU costs and is
not included in the DPU allowance.
Overage limit
In the event that more DPU Hours are consumed than are included in the monthly
package the user is charged overage at the on-demand price for DPU Hours.
The user may define a limit to control potential overspend. The limit is applied to DPU
and Data License costs.
The user is notified at 80% and 100% of the limit. At 100% processing of filters and
delivery of interactions are stopped.

Modifying Billing Account Details
Billing account details must be kept up-to-date. To modify the billing account details,
click Settings in the header of any web application page.

Figure 244: Settings Link
Click the Billing Account Details link and click Edit.

Figure 245: Edit Billing Account Details

Viewing Usage & Balance
To open the billing page, either click on the billing information in the header or click the
Billing tab.

Figure 246: Open Billing Page
Subscription Type
The top of the Billing Overview page includes the subscriptions type and credit. In this
example, it is a Professional level account which receives a 20,000 DPU allowance every
month.

Figure 247: Subscription Type
The 'Next billing cycle' date is when the allowance is reset to 20,000 DPU.
DPU Usage
The remaining DPU usage in this
billing period is displayed along
with the DPU used in the last 24
hours and the number of days
until the next billing cycle.

Data Cost
For most sources and augmentations,
there is a cost for each interaction
delivered. The Data Cost shows the
running cost this billing cycle, the cost
incurred in the last 24 hours and the
number of days remaining in the billing
cycle.

Setting Overage Limit
Overage is the sum of DPU cost over the allowance, and the data cost incurred in this
billing cycle. A limit can be set to prevent excessive bills.
Viewing Current Limit
The current limit is shown on the Billing Overview page. In this example it is set to
$100.

Figure 248: Viewing Cost Limit
At 80% of the overage limit, a notification is sent to the user. Tasks continue to be
processed.

Figure 249: Close to Limit Notification
At 100% of the limit another notification is sent along with an email. All running tasks
are stopped and new tasks cannot be started.

Figure 250: Reached Limit Notification

Setting New Limit
To set a new limit, click the Set Limit link on the Billing Overview page

Figure 251: Set Limit Link
Or from the account settings, click on Billing Account Details. Select the new limit or
enter a custom limit. Click Save Limit.

Figure 252: Setting Cost Limit
Viewing Usage Statistics
Usage statistics are available for the past 7 days. From the Billing tab, click Usage
Statistics. Statistics are divided into Public Sources and Managed Sources. Use the
drop-down menu to select individual days from the past week.

Figure 253: Viewing Usage Statistics

Viewing Cost by DPU & Data License
The first chart shows DPU and Data License cost. In this example, the DPU cost does not
appear. This is likely to be because the monthly DPU allowance is being used.

Figure 254: Cost in DPU
Viewing DPU Usage
This chart shows DPU usage and a running total for the past week. In this example, over
100,000 DPUs have been used.

Figure 255: DPU Usage

Viewing Data Volume
This chart shows the volume of data received each day for the past 7 days. The total is
broken down into Sources and Augmentations.

Figure 256: Data Volume
Viewing Connected Hours
The number of connected hours is displayed using different colors for the types of
connection. In this example, all connections were streaming. This number may be over
24 hours per day if multiple connections are made.

Figure 257: Connected Hours

Viewing Historic Hours
The number of hours in historic queries are shown in the final chart.

Figure 258: Historic Hours
Viewing Managed Source Statistics
The DPU usage for each type of managed source is reported in a dedicated chart. Click
on Managed Sources to reveal this chart.

Figure 259: Managed Source Usage Statistics

Viewing Current Streams
To see an overview of currently running streams, go to the Billing page and click
Currently Used Streams. This example shows three streams with their type, start time
and DPU.

Figure 260: Viewing Current Streams
Click on a stream name to open the task summary.
Locating Pricing Information
Data License prices are available in the web application. From the Data Sources page,
click on any source or augmentation to see the price per 1,000 interactions.

Figure 261: Data License Pricing
Data Processing prices are detailed in the documentation along with a description of
billing and an FAQ
LINK:
Understanding Billing: http://dev.datasift.com/docs/billing
Billing FAQ: http://dev.datasift.com/docs/getting-started/billingfaq



17 Locating Help
This module reviews of all the resources available to you while you become more
experienced with the DataSift platform.
Locating Documentation
Hundreds of pages of documentation
are available on the developer web
site along with a user guide.
Documentation includes:
Targets for all sources and
augmentations
API endpoints
CSDL programming
Destinations
Billing
Historics
FAQs
Best Practice

LINK:
http://dev.datasift.com/docs

Figure 262: Documentation Pages

User Guide
Documentation of the
platform and web application
is available as a PDF user
guide.

Figure 263: User Guide

Viewing Platform Status
The status.datasift.com site displays the live status and latency information about all
parts of the DataSift platform.

Figure 264: Status Web Site
Click on individual issues to see the status.

Figure 265: Individual Issue Status

Viewing Known Issues
The DataSift issue tracking system (dev.datasift.com/issues) is used to report and give
updates on bugs found in the platform. Before reporting any new issue, be sure to
check the issues list.
Existing issues are automatically linked to the documentation pages and related
discussion threads.

Figure 266: Issue Tracking
Forum Discussions
The discussion forums (dev.datasift.com/discussions) are an ideal place to ask questions
about any part of the platform and receive advice from more experienced users.

Figure 267: Discussion Forum

Subscribing to Updates
Platform status and known issue updates are provided on Twitter. Follow the following
accounts to receive updates:
@datasiftapi
@datasiftdev
The developer blogs contain useful information on how to achieve particular goals. Use
the following link for an RSS feed.
http://dev.datasift.com/blog/feed

Attending Workshops
Remote workshops covering new features are delivered regularly. Look on
datasift.com/learning for the latest schedule.

Figure 268: Workshop Schedule

Submitting Support Requests
The support site (support.datasift.com) is where support requests are raised. Be sure to
check known issues and platform status before submitting a request.
Solutions to requests are often available in the forums or documentation.
When raising a support request pay careful attention to selecting the correct priority
level.
Urgent
o A major component of DataSift is unavailable, and status.datasift.com is
not reporting any irregularities with the service in question
High
o A major component of DataSift is not behaving normally. This issue is
causing considerable business impact
Normal
o Minor issues using DataSift. Workarounds are generally available. This
issue is causing minor business impact
Low
o No immediate business impact
The support site also allows historical requests to be viewed.

Figure 269: View Historical Requests

Viewing Screencasts
DataSift has a YouTube channel with playlists of training screencasts and workshop
recordings.
Training Playlist
o https://www.youtube.com/playlist?list=PLzM6Pg1YzDVoInlRyMiw6AQHisW
wiDtZQ
Workshop Recordings
o https://www.youtube.com/playlist?list=PLzM6Pg1YzDVqTcakqPqFrMBKsje
0R9cSO

Figure 270: DataSift YouTube Channel
Attending Training
Further instructor-led learning is available which covers the platform fundamentals
using the API and advanced filtering.
DataSift Fundamentals (API) Advanced Filtering
DataSift Overview Case Sensitivity and Tokenization
Configure Sources & Augmentations Importing Filters
Write Simple CSDL Filters Tagging
Analyze Interactions Cascading Tags
Configure Destinations Scoring
Configure Live Push Subscriptions Using Classifier Libraries
Configure Live Pull Subscriptions Wildcards and Regular Expressions
Configure Historic Previews CSDL Optimization
Configure Historic Recordings
Billing and Usage Monitoring
Locating Help



DataSift User Guide

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DataSift User Guide

Transféré par

Droits d'auteur :

Formats disponibles

User Guide

Vous aimerez peut-être aussi