Vous êtes sur la page 1sur 18

Running head: WEB MINING 1

WEB MINING

Hal Hagood

u08a1
WEB MINING 2

(Instructions) Use the Kaggle ISIS tweets data set to identify patterns and types of

communication that ISIS uses on Twitter. This data set is downloadable from the site linked in Resources.

Using this data, conduct a cluster analysis to categorize tweets into common clusters, and use a

decision tree to explain and interpret each cluster. Include any visualizations, screenshots, and

explanations needed to make and support your created clusters and their interpretation.

Once the clusters are created, summarize the data by user, and analyze the allocation of the top

10 users to each cluster. What type of tweets do they commonly post? What percent of the time does

each user post each tweet category?

Include any additional analysis that you completed and found useful for understanding what types

of tweets these ISIS-related or pro-ISIS users post.

“We scraped over 17,000 tweets from 100+ pro-ISIS fanboys from all over the world since the

November 2015 Paris Attacks. We are working with content producers and influencers to develop

effective counter-messaging measures against violent extremists at home and abroad. In order to

maximize our impact, we need assistance in quickly analyzing message frames” (Kaggle, 2017).

“The dataset includes the following:

Name

Username

Description

Location

Number of followers at the time the tweet was downloaded

Number of statuses by the user when the tweet was downloaded

Date and timestamp of the tweet

The tweet itself” (Kaggle, 2017)


WEB MINING 3

Conducts exploratory analysis of text documents on the Internet to extract key topics or themes
and provides supporting examples

The methodology for this report concerns the collection of data from internet data sources. By

using certain tools and or algorithms, data can be gathered from different internet sources. The particular

data set of tweets are related to ISIS. Using Enterprise Miner we will perform an analysis of this data.

The csv file and the xlsx file were imported into SAS Enterprise Miner, using the File Import Node

and two separate analyses were performed as a comparison. All settings were left to default with the

exception of maximum columns to be imported set at 50000. Next the Text Parsing node that analyzes

the imported documents and breaks them down into usable words i.e. synonyms and their various core

forms for example. The Text filter Node is next which as the name suggests can be used to remove

certain unwanted or unneeded terms. For this particular exercise spell checking was set to yes and the

terms to view and terms to display to “All”, everything else remained as default.

Next was the addition of the Text Cluster Node, SAS Enterprise Miner clusters the documents

into sets and provides a report on the expressive terms for those clusters. The text cluster is then

connected to the Text Filter Node using the default settings for all and obtained 22 clusters from the ISIS

csv dataset and 21 from the xlsx file.

The Text Topic Node was next, this node enables exploration of the document by automatically

associating terms and documents. This is accomplished according to both discovered and user-defined

topics. These topics relate to main themes or ideas. Single-term topics was set to 5 while the number of

multiple-term topics was set to 20. These topics are explored in detail further in this paper.

The decision tree node was the final one added. Using this particular node Decision trees are

produced by algorithms. These identify various ways of splitting a data set into branch-like segments.

These segments form an inverted decision tree that originates with a root node at the top of the tree.

The object of analysis is reflected in this root node as a simple, one-dimensional display in the decision

tree interface. This was performed for both the csv and xlsx datasets. A Link Analysis was added at the

end of each Decision Tree to provide for deeper analysis and visualization.
WEB MINING 4

Tweets.csv

Text Parsing

Text Filter
WEB MINING 5

Terms

Text Cluster
WEB MINING 6

Clusters

Text Topic
WEB MINING 7

Decision Tree

(Tweets and subject matter)

TextCluster_SVD16
WEB MINING 8

Link Analysis

(All)

(Username)
WEB MINING 9

Tweets.xlsx

Text Parsing

Text Filter
WEB MINING 10

Terms

Text Cluster
WEB MINING 11

Clusters

Text Topic
WEB MINING 12

Decision Tree

(Users)

Followers
WEB MINING 13

Link Analysis

(Location)
WEB MINING 14

Summarizes the findings of an exploratory analysis of text documents on the Internet to extract key topics
or themes and provides supporting examples.

Concept linking (ISIS)


WEB MINING 15

The top 5 Topics are allepo, rebel, rt, northern, countryside with 100 terms and 671 documents.

amaqagency, fighter, islamicstate, hit Iraq with 95 terms and 412 documents. russia, syria, palmyra,

airstrike, rt with 93 terms and 640 documents. Allah, brother, protect, accept, rt with 92 terms and 591

documents and and rt https, nidalgazaui, sparksofirhabi3, sparksofirhabi5 with 87 terms and 2865

documents the remaining terms and document count can be seen in the above screenshot.

The dataset includes the following: Name, Username, Description, Location, Number of followers

at the time the tweet was downloaded, and Number of statuses by the user when the tweet was

downloaded, date and timestamp of the tweet, the tweet itself. One cannot say if a cluster is right or

wrong but if it makes sense. Cluster analysis is a combination of both art and science.

Network Cluster Analysis shows who the major users in the pro-ISIS twitter network are. A

Keyword Analysis shows which keywords resulting from the name, username, description, location, and

tweet were the most commonly used by ISIS so called fan boys

Examples include: baqiyah, dabiq, wilayat and amaq. Data Categorization of Links shows which

websites pro-ISIS followers are linking to. These Categories include Mainstream Media, Alt media,

Jihadist Websites, Image Upload, and Video Upload.

Sentiment Analysis shows which clergy do pro-ISIS fan boys quote the most and which ones do

they dislike the most. Examples of clergy they like the most: “Anwar Awlaki”, “Ahmad Jibril”, “Ibn

Taymiyyah”, “Abdul Wahhab”. Examples of clergy that they dislike the most: “Hamza Yusuf”, “Suhaib

Webb”, “Yaser Qadhi”, “Nouman Ali Khan”, “Yaqoubi”.


WEB MINING 16

INITIAL ANALYSIS AND FILTRATION

(Top 10)

(Kaggle, 2017)
WEB MINING 17

Reference

Kaggle, (2017). ISIS TWEET NETWORK ANALYSIS. Retrieved August 12, 2017 from

https://www.kaggle.com/ggospodinov/tweet-analysis2

SAS, (2017). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS

Chapter 6 - Clustering and Topic Extraction. Retrieved August 27, 2017 from

http://viewer.books24x7.com/assetviewer.aspx?bookid=59026&chunkid=342485391&resume=ye

s&resumebookmarkid=7038ed2b-478b-e711-a9c3-00505686029c#
WEB MINING 18

Vous aimerez peut-être aussi