Académique Documents
Professionnel Documents
Culture Documents
April 2, 2016
Title Consistent Clustering for Text Data
Version 0.1.1
Maintainer Tyler Rinker <tyler.rinker@gmail.com>
Description Optimized, consistent tools for clustering text data.
Depends R (>= 3.2.3)
Imports dplyr, fastcluster, gofastr, graphics, Matrix, mclust,
methods, rNMF, skmeans, slam, stats, termco, textshape, tm
Suggests testthat
Date 2016-04-02
License GPL-2
LazyData TRUE
Roxygen list(wrap = FALSE)
RoxygenNote 5.0.1
Author Tyler Rinker [aut, cre]
RemoteType local
RemoteUrl C:\{}Users\{}Tyler\{}GitHub\{}clustext
RemoteUsername trinker
RemoteRepo clustext
R topics documented:
approx_k . . .
assignments . .
assign_cluster .
as_topic . . . .
categorize . . .
clustext . . . .
compare . . . .
cosine_distance
data_store . . .
get_documents
get_dtm . . . .
get_removed . .
get_terms . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
3
5
6
7
7
8
9
10
11
12
13
approx_k
get_text . . . . . . . . . .
hierarchical_cluster . . . .
jaccard_distance . . . . . .
kmeans_cluster . . . . . .
nmf_cluster . . . . . . . .
plot.hierarchical_cluster . .
presidential_debates_2012
print.assign_cluster . . . .
print.as_topic . . . . . . .
print.compare . . . . . . .
print.data_store . . . . . .
print.get_documents . . . .
print.get_terms . . . . . .
skmeans_cluster . . . . . .
summary.assign_cluster . .
write_cluster_text . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
approx_k
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
16
17
18
19
20
20
21
21
21
22
22
22
23
24
26
Description
Can & Ozkarahan (1990) formula for approximating the number of clusters for a text matrix: (m
n)/t where m and n are the dimensions of the matrix and t is the length of the non-zero elements
in matrix A.
Usage
approx_k(x, verbose = TRUE)
## S3 method for class 'TermDocumentMatrix'
approx_k(x, verbose = TRUE)
## S3 method for class 'DocumentTermMatrix'
approx_k(x, verbose = TRUE)
Arguments
x
A matrix.
verbose
Value
Returns an integer.
References
Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems 15 (4): 483.
doi:10.1145/99935.99938.
assignments
Examples
library(gofastr)
library(dplyr)
presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
approx_k()
assignments
Topic Assignments
Description
A dataset containing a list of topic assignments by various clustering algorithms. Assignments
correspond to the rows (minus empty rows) of the presidential_debates_2012 data set.
Usage
data(assignments)
Format
A list with 3 elements
assign_cluster
Description
Assign clusters to documents/text elements.
Usage
assign_cluster(x, k = approx_k(get_dtm(x)), h = NULL, ...)
## S3 method for class 'hierarchical_cluster'
assign_cluster(x, k = approx_k(get_dtm(x)),
h = NULL, ...)
## S3 method for class 'kmeans_cluster'
assign_cluster(x, ...)
## S3 method for class 'skmeans_cluster'
assign_cluster(x, ...)
## S3 method for class 'nmf_cluster'
assign_cluster(x, ...)
assign_cluster
Arguments
x
a xxx_cluster object.
The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage.
...
ignored.
Value
Returns an assign_cluster object; a named vector of cluster assignments with documents as
names. The object also contains the original data_storage object and a join function. join is a
function (a closure) that captures information about the assign_cluster that makes rejoining to
the original data set simple. The user simply supplies the original data set as an argument to join
(attributes(FROM_ASSIGN_CLUSTER)$join(ORIGINAL_DATA)).
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
hierarchical_cluster(x) %>%
plot(h=.7, lwd=2)
hierarchical_cluster(x) %>%
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
hierarchical_cluster(x) %>%
assign_cluster(k=6)
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
ca2 <- assign_cluster(x2, k = 55)
summary(ca2)
## add to original data
attributes(ca2)$join(presidential_debates_2012)
## split text into clusters
get_text(ca2)
## Kmeans Algorithm
kmeans_cluster(x, k=6) %>%
assign_cluster()
as_topic
as_topic
Description
View important terms as a comma separated string (a topic).
Usage
as_topic(x, max.n = 8, sort = TRUE, ...)
## S3 method for class 'get_terms'
as_topic(x, max.n = 8, sort = TRUE, ...)
Arguments
x
A get_terms object.
max.n
sort
logical. If TRUE the cluster topics are sorted by size (number of documents)
otherwise the topics are sorted by cluster number.
...
ignored.
Value
Returns a data.frame of "cluster", "count", and "terms". Pretty prints as clusters, number of
documents, and associated important terms.
Examples
library(dplyr)
myfit5 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
ca5 <- assign_cluster(myfit5, k = 50)
get_terms(ca5, .4) %>%
as_topic()
categorize
categorize
Description
Merge clusters, categories, and the original data back together.
Usage
categorize(data, assign.cluster, cluster.key)
Arguments
data
An assign_cluster object.
Value
Returns a data.frame key of clusters and categories.
See Also
write_cluster_text, read_cluster_text
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)
write_cluster_text(ca, lead=" -", n.sample=10)
## Read Human Coded Categories Back In
categories_file <- system.file("additional/foo_turk.txt", package = "clustext")
readLines(categories_file)
(categories_key <- read_cluster_text(categories_file))
## Add Categories Back to Original Data Set
clustext
categorize(
data = presidential_debates_2012,
assign.cluster = ca,
cluster.key = categories_key
)
clustext
Description
Optimized, consistent tools for clustering text data.
compare
Description
An Adjusted Rand Index comparison of the assignments between different clustering algorithms.
Usage
compare(...)
Arguments
...
Value
Returns a pair-wise comparison matrix of Adjusted Rand Indices for algorithm. Higher Adjusted
Rand Index scores indicate higher cluster assignment agreement.
References
http://faculty.washington.edu/kayee/pca/supp.pdf
Examples
compare(
assignments$hierarchical_assignment,
assignments$kmeans_assignment,
assignments$skmeans_assignment,
assignments$nmf_assignment
)
## Understanding the ARI
set.seed(10)
w <- sample(1:10, 40, TRUE)
x <- 11-w
set.seed(20)
cosine_distance
y <- sample(1:10, 40, TRUE)
set.seed(50)
z <- sample(1:10, 40, TRUE)
data.frame(w, x, y, z)
library(mclust)
mclust::adjustedRandIndex(w, x)
mclust::adjustedRandIndex(x, y)
mclust::adjustedRandIndex(x, z)
cosine_distance
Description
Utilizes the slam package to efficiently calculate cosine distance on large sparse matrices.
Usage
cosine_distance(x, ...)
## S3 method for class 'DocumentTermMatrix'
cosine_distance(x, ...)
## S3 method for class 'TermDocumentMatrix'
cosine_distance(x, ...)
Arguments
x
...
Value
Returns a cosine distance object of class "dist".
Author(s)
Michael Andrec and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/29755756/1000343
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
cosine_distance()
data_store
data_store
Description
A data structure which stores the text, DocumentTermMatrix, and information regarding removed
text elements which can not be handled by the hierarchical_cluster function. This structure is
required because it documents important meta information, including removed elements, required
by other hclustext functions. If the user wishes to combine documents (say by a common grouping
variable) it is recomended this be handled by combine prior to using data_store.
Usage
data_store(text, doc.names, min.term.freq = 1, min.doc.len = 1,
stopwords = tm::stopwords("english"), min.char = 3, max.char = NULL,
stem = FALSE, denumber = TRUE)
Arguments
text
A character vector.
doc.names
min.term.freq
min.doc.len
The minimum words a document must contain to be included in the data structure (other wise it is stored as a removed element).
stopwords
min.char
max.char
stem
denumber
Value
Returns a list containing:
dtm A tf-idf weighted DocumentTermMatrix
text The text vector with unanalyzable elements removed
removed The indices of the removed text elements, i.e., documents not meeting min.doc.len
n.nonsparse The length of the non-zero elements
Examples
data_store(presidential_debates_2012[["dialogue"]])
## Use `combine` to merge text prior to `data_stare`
library(textshape)
library(dplyr)
dat <- presidential_debates_2012 %>%
dplyr::select(person, time, dialogue) %>%
10
get_documents
textshape::combine()
## Elements in `ds` correspond to `dat` grouping vars
(ds <- with(dat, data_store(dialogue)))
dplyr::select(dat, -3)
## Add row names
(ds2 <- with(dat, data_store(dialogue, paste(person, time, sep = "_"))))
rownames(ds2[["dtm"]])
## Get a DocumentTermMatrix
get_dtm(ds2)
get_documents
Description
Get the documents associated with each of the k clusters .
Usage
get_documents(x, ...)
## S3 method for class 'assign_cluster'
get_documents(x, ...)
Arguments
x
...
A assign_cluster object.
ignored.
Value
Returns a list of vectors of document names.
Examples
library(dplyr)
mydocuments1 <- presidential_debates_2012 %>%
with(data_store(dialogue, paste(person, time, sep="-"))) %>%
hierarchical_cluster() %>%
assign_cluster(k = 6) %>%
get_documents()
mydocuments1
mydocuments2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 55) %>%
get_documents()
mydocuments2
get_dtm
get_dtm
11
Description
Extract the DocumentTermMatrix supplied to/produced by a hierarchical_cluster object.
Usage
get_dtm(x, ...)
## S3 method for class 'data_store'
get_dtm(x, ...)
## S3 method for class 'hierarchical_cluster'
get_dtm(x, ...)
## S3 method for class 'kmeans_cluster'
get_dtm(x, ...)
## S3 method for class 'skmeans_cluster'
get_dtm(x, ...)
## S3 method for class 'nmf_cluster'
get_dtm(x, ...)
Arguments
x
A hierarchical_cluster object.
...
ignored.
Value
Returns a DocumentTermMatrix.
Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_dtm()
12
get_removed
get_removed
Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_removed(x, ...)
## S3 method for class 'hierarchical_cluster'
get_removed(x, ...)
## S3 method for class 'kmeans_cluster'
get_removed(x, ...)
## S3 method for class 'skmeans_cluster'
get_removed(x, ...)
## S3 method for class 'nmf_cluster'
get_removed(x, ...)
## S3 method for class 'data_store'
get_removed(x, ...)
Arguments
x
A hierarchical_cluster object.
...
ignored.
Value
Returns a vector of text strings.
Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_removed()
get_terms
13
get_terms
Description
Get the terms weighted (either by tf-idf or returned from the model) and min/max scaling associated
with each of the k clusters .
Usage
get_terms(x, min.weight = 0.6, nrow = NULL, ...)
## S3 method for class 'assign_cluster_hierarchical'
get_terms(x, min.weight = 0.6,
nrow = NULL, ...)
## S3 method for class 'assign_cluster_kmeans'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
## S3 method for class 'assign_cluster_skmeans'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
## S3 method for class 'assign_cluster_nmf'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
Arguments
x
A assign_cluster object.
min.weight
nrow
...
ignored.
Value
Returns a list of data.frames of top weighted terms.
Examples
library(dplyr)
library(textshape)
myterms <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 55) %>%
get_terms()
14
get_text
myterms
textshape::bind_list(myterms[!sapply(myterms, is.null)], "Topic")
## Not run:
library(ggplot2)
library(gridExtra)
library(dplyr)
library(textshape)
library(wordcloud)
max.n <- max(textshape::bind_list(myterms)[["n"]])
myplots <- Map(function(x, y){
x %>%
mutate(term = factor(term, levels = rev(term))) %>%
ggplot(aes(term, weight=n)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0),limits=c(0, max.n)) +
ggtitle(sprintf("Topic: %s", y)) +
coord_flip()
}, myterms, names(myterms))
myplots[["ncol"]] <- 10
do.call(gridExtra::grid.arrange, myplots[!sapply(myplots, is.null)])
##wordclouds
par(mfrow=c(5, 11), mar=c(0, 4, 0, 0))
Map(function(x, y){
wordcloud::wordcloud(x[[1]], x[[2]], scale=c(1,.25),min.freq=1)
mtext(sprintf("Topic: %s", y), col = "blue", cex=.55, padj = 1.5)
}, myterms, names(myterms))
## End(Not run)
get_text
Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_text(x, ...)
## S3 method for class 'hierarchical_cluster'
get_text(x, ...)
## S3 method for class 'kmeans_cluster'
get_text(x, ...)
## S3 method for class 'nmf_cluster'
get_text(x, ...)
hierarchical_cluster
15
A hierarchical_cluster object.
...
ignored.
Value
Returns a vector or list of text strings.
Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_text() %>%
head()
hierarchical_cluster
Description
Fit a hierarchical cluster to text data. Prior to distance measures being calculated the tf-idf (see
weightTfIdf) is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the
distance matrix supplied to hclust. method defaults to "ward.D2". A faster cosine dissimilarity
calculation is used under the hood (see cosine_distance). Additionally, hclust is used to quickly
calculate the fit. Essentially, this is a wrapper function optimized for clustering text data.
Usage
hierarchical_cluster(x, distance = "cosine", method = "ward.D2", ...)
## S3 method for class 'data_store'
hierarchical_cluster(x, distance = "cosine",
method = "ward.D", ...)
16
jaccard_distance
Arguments
x
distance
method
The agglomeration method to be used. This must be (an unambiguous abbreviation of) one of "single", "complete", "average", "mcquitty", "ward.D",
"ward.D2", "centroid", or "median".
...
ignored.
Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
hierarchical_cluster(x) %>%
plot(k=4)
hierarchical_cluster(x) %>%
plot(h=.7, lwd=2)
hierarchical_cluster(x) %>%
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
hierarchical_cluster(x) %>%
assign_cluster(k=6)
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- hierarchical_cluster(x2)
plot(myfit2)
plot(myfit2, 55)
assign_cluster(myfit2, k = 55)
jaccard_distance
Description
Utilizes the slam package to efficiently calculate jaccard distance on large sparse matrices.
kmeans_cluster
17
Usage
jaccard_distance(x, ...)
## S3 method for class 'DocumentTermMatrix'
jaccard_distance(x, ...)
## S3 method for class 'TermDocumentMatrix'
jaccard_distance(x, ...)
Arguments
x
...
ignored.
Value
Returns a jaccard distance object of class "dist".
Author(s)
user41844 of StackOverflow, Dmitriy Selivanov, and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/36373333/1000343 http://stats.stackexchange.com/a/89947/
7482
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
jaccard_distance()
kmeans_cluster
Description
Fit a kmeans cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix.
Usage
kmeans_cluster(x, k, ...)
## S3 method for class 'data_store'
kmeans_cluster(x, k, ...)
18
nmf_cluster
Arguments
x
...
Value
Returns an object of class "kmeans".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
kmeans_cluster(x, k=6)
kmeans_cluster(x, k=6) %>%
assign_cluster()
kmeans_cluster(x, k=6) %>%
assign_cluster() %>%
summary()
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- kmeans_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
nmf_cluster
Description
Fit a robust non-negative matrix factorization cluster to text data via rnmf. Prior to distance measures being calculated the tf-idf (see weightTfIdf) is applied to the DocumentTermMatrix.
Usage
nmf_cluster(x, k = k, ...)
## S3 method for class 'data_store'
nmf_cluster(x, k, ...)
plot.hierarchical_cluster
Arguments
x
k
...
Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
model6 <- nmf_cluster(x, k=6)
model6 %>%
assign_cluster()
model6 %>%
assign_cluster() %>%
summary()
## Not run:
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- nmf_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
## End(Not run)
plot.hierarchical_cluster
Plots a hierarchical_cluster Object
Description
Plots a hierarchical_cluster object
Usage
## S3 method for class 'hierarchical_cluster'
plot(x, k = approx_k(get_dtm(x)), h = NULL,
color = "red", ...)
19
20
print.assign_cluster
Arguments
x
k
color
...
A hierarchical_cluster object.
The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage. Boxes are drawn around
the clusters.
The height at which to cut the dendrograms (determines number of clusters). If
this argument is supplied k is ignored. A line is drawn showing the cut point on
the dendrogram.
The color to make the cluster boxes (k) or line (h).
Other arguments passed to rect.hclust or abline.
presidential_debates_2012
2012 U.S. Presidential Debates
Description
A dataset containing a cleaned version of all three presidential debates for the 2012 election.
Usage
data(presidential_debates_2012)
Format
A data frame with 2912 rows and 4 variables
Details
print.assign_cluster
Description
Prints an assign_cluster object
Usage
## S3 method for class 'assign_cluster'
print(x, ...)
Arguments
x
...
An assign_cluster object.
ignored.
print.as_topic
21
print.as_topic
Description
Prints an as_topic object
Usage
## S3 method for class 'as_topic'
print(x, ...)
Arguments
x
...
An as_topic object.
ignored.
print.compare
Description
Prints a compare object.
Usage
## S3 method for class 'compare'
print(x, digits = 3, ...)
Arguments
x
digits
...
print.data_store
Description
Prints a data_store object
Usage
## S3 method for class 'data_store'
print(x, ...)
Arguments
x
...
A data_store object.
ignored.
22
skmeans_cluster
print.get_documents
Description
Prints a get_documents object
Usage
## S3 method for class 'get_documents'
print(x, ...)
Arguments
x
A get_documents object.
...
ignored.
print.get_terms
Description
Prints a get_terms object
Usage
## S3 method for class 'get_terms'
print(x, ...)
Arguments
x
A get_terms object.
...
ignored.
skmeans_cluster
Description
Fit a skmean cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the distance matrix
supplied to skmeans.
Usage
skmeans_cluster(x, k, ...)
## S3 method for class 'data_store'
skmeans_cluster(x, k, ...)
summary.assign_cluster
23
Arguments
x
k
...
Value
Returns an object of class "skmean".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
myfit1 <- skmeans_cluster(x, k=6)
myfit1 %>%
assign_cluster()
myfit1 %>%
assign_cluster() %>%
summary()
## Not run:
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- skmeans_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
## End(Not run)
summary.assign_cluster
Summary of an assign_cluster Object
Description
Summary of an assign_cluster object
Usage
## S3 method for class 'assign_cluster'
summary(object, plot = TRUE, print = TRUE, ...)
24
write_cluster_text
Arguments
object
An assign_cluster object.
plot
...
ignored.
write_cluster_text
Description
Write cluster text from get_text(assign_cluster(myfit)) to an external file for categorization. After file has been written with write_cluster_text a human coder can assign categories
to each cluster. Simple write the category after the Cluster #:. To set a cluster category equal to
another simply write and equal sign follwed by the other cluster to set as the same category (e.g.,
Cluster 10: =5 to set cluster #10 the same as cluster #5). See readLines(system.file("additional/foo_turk.txt"
for an example.
Usage
write_cluster_text(x, path, n.sample = NULL, lead = " * ", ...)
read_cluster_text(path, ...)
Arguments
x
An assign_cluster object.
path
n.sample
The length to limit the sample to (default gives all text in the cluster). Setting
this to an integer uses this as the number to randomly sample from.
lead
...
ignored.
See Also
categorize
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)
write_cluster_text
write_cluster_text(ca, lead="
25
-", n.sample=10)
Index
kmeans_cluster, 17
Topic cosine
cosine_distance, 8
Topic datasets
assignments, 3
presidential_debates_2012, 20
Topic data
data_store, 9
Topic dissimilarity
cosine_distance, 8
jaccard_distance, 16
Topic jaccard
jaccard_distance, 16
Topic structure
data_store, 9
nmf_cluster, 18
package-clustext (clustext), 7
plot.hierarchical_cluster, 19
presidential_debates_2012, 20
print.as_topic, 21
print.assign_cluster, 20
print.compare, 21
print.data_store, 21
print.get_documents, 22
print.get_terms, 22
read_cluster_text, 6
read_cluster_text (write_cluster_text),
24
rect.hclust, 20
rnmf, 18, 19
abline, 20
approx_k, 2
as_topic, 5
assign_cluster, 3, 6, 10, 13
assignments, 3
skmeans, 22, 23
skmeans_cluster, 22
summary.assign_cluster, 23
categorize, 6, 24
clustext, 7
clustext-package (clustext), 7
combine, 9
compare, 7
cosine_distance, 8, 15
TermDocumentMatrix, 8, 17
vector, 10
weightTfIdf, 15, 17, 18, 22
write_cluster_text, 6, 24
data.frame, 5, 6, 13
data_store, 9, 16, 18, 19, 23
DocumentTermMatrix, 4, 8, 9, 11, 15, 17, 18,
20, 22
get_documents, 10
get_dtm, 11
get_removed, 12
get_terms, 13
get_text, 14
hclust, 15
hierarchical_cluster, 11, 12, 14, 15, 15
jaccard_distance, 16
kmeans, 18
26