Clustext

Package clustext
April 2, 2016
Title Consistent Clustering for Text Data
Version 0.1.1
Maintainer Tyler Rinker <tyler.rinker@gmail.com>
Description Optimized, consistent tools for clustering text data.
Depends R (>= 3.2.3)
Imports dplyr, fastcluster, gofastr, graphics, Matrix, mclust,
methods, rNMF, skmeans, slam, stats, termco, textshape, tm
Suggests testthat
Date 2016-04-02
License GPL-2
LazyData TRUE
Roxygen list(wrap = FALSE)
RoxygenNote 5.0.1
Author Tyler Rinker [aut, cre]
RemoteType local
RemoteUrl C:\{}Users\{}Tyler\{}GitHub\{}clustext
RemoteUsername trinker
RemoteRepo clustext
R topics documented:
approx_k . . .
assignments . .
assign_cluster .
as_topic . . . .
categorize . . .
clustext . . . .
compare . . . .
cosine_distance
data_store . . .
get_documents
get_dtm . . . .
get_removed . .
get_terms . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
3
5
6
7
7
8
9
10
11
12
13
approx_k
get_text . . . . . . . . . .
hierarchical_cluster . . . .
jaccard_distance . . . . . .
kmeans_cluster . . . . . .
nmf_cluster . . . . . . . .
plot.hierarchical_cluster . .
presidential_debates_2012
print.assign_cluster . . . .
print.as_topic . . . . . . .
print.compare . . . . . . .
print.data_store . . . . . .
print.get_documents . . . .
print.get_terms . . . . . .
skmeans_cluster . . . . . .
summary.assign_cluster . .
write_cluster_text . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
approx_k
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
16
17
18
19
20
20
21
21
21
22
22
22
23
24
26
Approximate Number of Clusters for a Text Matrix
Description
Can & Ozkarahan (1990) formula for approximating the number of clusters for a text matrix: (m
n)/t where m and n are the dimensions of the matrix and t is the length of the non-zero elements
in matrix A.
Usage
approx_k(x, verbose = TRUE)
## S3 method for class 'TermDocumentMatrix'
## S3 method for class 'DocumentTermMatrix'
Arguments
x
A matrix.
verbose
logical. If TRUE the k determination is printed.
Value
Returns an integer.
References
Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems 15 (4): 483.
doi:10.1145/99935.99938.
assignments
Examples
library(gofastr)
library(dplyr)
presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
approx_k()
assignments
Topic Assignments
Description
A dataset containing a list of topic assignments by various clustering algorithms. Assignments
correspond to the rows (minus empty rows) of the presidential_debates_2012 data set.
Usage
data(assignments)
Format
A list with 3 elements
assign_cluster
Assign Clusters to Documents/Text Elements
Description
Assign clusters to documents/text elements.
Usage
assign_cluster(x, k = approx_k(get_dtm(x)), h = NULL, ...)
## S3 method for class 'hierarchical_cluster'
assign_cluster(x, k = approx_k(get_dtm(x)),
h = NULL, ...)
## S3 method for class 'kmeans_cluster'
assign_cluster(x, ...)
## S3 method for class 'skmeans_cluster'
## S3 method for class 'nmf_cluster'
assign_cluster
Arguments
x
a xxx_cluster object.
The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage.
The height at which to cut the dendrograms (determines number of clusters). If

this argument is supplied k is ignored.
...
ignored.
Value
Returns an assign_cluster object; a named vector of cluster assignments with documents as
names. The object also contains the original data_storage object and a join function. join is a
function (a closure) that captures information about the assign_cluster that makes rejoining to
the original data set simple. The user simply supplies the original data set as an argument to join
(attributes(FROM_ASSIGN_CLUSTER)$join(ORIGINAL_DATA)).
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
hierarchical_cluster(x) %>%
plot(h=.7, lwd=2)
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
assign_cluster(k=6)
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
ca2 <- assign_cluster(x2, k = 55)
summary(ca2)
## add to original data
attributes(ca2)$join(presidential_debates_2012)
## split text into clusters
get_text(ca2)
## Kmeans Algorithm
kmeans_cluster(x, k=6) %>%
assign_cluster()
as_topic

kmeans_cluster(55)
ca3 <- assign_cluster(x3)
summary(ca3)
## split text into clusters
get_text(ca3)
as_topic
Convert get_terms to Topics
Description
View important terms as a comma separated string (a topic).
Usage
as_topic(x, max.n = 8, sort = TRUE, ...)
## S3 method for class 'get_terms'
as_topic(x, max.n = 8, sort = TRUE, ...)
Arguments
x
A get_terms object.
max.n
The max number of words to show before truncation.
sort
logical. If TRUE the cluster topics are sorted by size (number of documents)
otherwise the topics are sorted by cluster number.
...
ignored.
Value
Returns a data.frame of "cluster", "count", and "terms". Pretty prints as clusters, number of
documents, and associated important terms.
Examples
library(dplyr)
myfit5 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
ca5 <- assign_cluster(myfit5, k = 50)
get_terms(ca5, .4) %>%
as_topic()
categorize

as_topic(sort=FALSE)
as_topic()
categorize
Merge Clusters & Cluster Categories Back to Original Data
Description
Merge clusters, categories, and the original data back together.
Usage
categorize(data, assign.cluster, cluster.key)
Arguments
data
A data set that was fit with a cluster model.
assign.cluster An assign_cluster object.

cluster.key
An assign_cluster object.
Value
Returns a data.frame key of clusters and categories.
See Also
write_cluster_text, read_cluster_text
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
hierarchical_cluster() %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)
write_cluster_text(ca, lead=" -", n.sample=10)
## Read Human Coded Categories Back In
categories_file <- system.file("additional/foo_turk.txt", package = "clustext")
readLines(categories_file)
(categories_key <- read_cluster_text(categories_file))
## Add Categories Back to Original Data Set
clustext
categorize(
data = presidential_debates_2012,
assign.cluster = ca,
cluster.key = categories_key
)
clustext
Consistent Clustering for Text Data
Description
Optimized, consistent tools for clustering text data.
compare
Adjusted Rand Index Comaprison Between Algorithms
Description
An Adjusted Rand Index comparison of the assignments between different clustering algorithms.
Usage
compare(...)
Arguments
...
A series of outputs from assign_cluster for various cluster algorithmns.
Value
Returns a pair-wise comparison matrix of Adjusted Rand Indices for algorithm. Higher Adjusted
Rand Index scores indicate higher cluster assignment agreement.
References
http://faculty.washington.edu/kayee/pca/supp.pdf
Examples
compare(
assignments$hierarchical_assignment,
assignments$kmeans_assignment,
assignments$skmeans_assignment,
assignments$nmf_assignment
)
## Understanding the ARI
set.seed(10)
w <- sample(1:10, 40, TRUE)
x <- 11-w
set.seed(20)
cosine_distance
y <- sample(1:10, 40, TRUE)
set.seed(50)
z <- sample(1:10, 40, TRUE)
data.frame(w, x, y, z)
library(mclust)
mclust::adjustedRandIndex(w, x)
mclust::adjustedRandIndex(x, y)
mclust::adjustedRandIndex(x, z)
cosine_distance
Optimized Computation of Cosine Distance
Description
Utilizes the slam package to efficiently calculate cosine distance on large sparse matrices.
Usage
cosine_distance(x, ...)
Arguments
x
...
A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).

ignored.
Value
Returns a cosine distance object of class "dist".
Author(s)
Michael Andrec and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/29755756/1000343
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
cosine_distance()
data_store
data_store
Data Structure for hclusttext
Description
A data structure which stores the text, DocumentTermMatrix, and information regarding removed
text elements which can not be handled by the hierarchical_cluster function. This structure is
required because it documents important meta information, including removed elements, required
by other hclustext functions. If the user wishes to combine documents (say by a common grouping
variable) it is recomended this be handled by combine prior to using data_store.
Usage
data_store(text, doc.names, min.term.freq = 1, min.doc.len = 1,
stopwords = tm::stopwords("english"), min.char = 3, max.char = NULL,
stem = FALSE, denumber = TRUE)
Arguments
text
A character vector.
doc.names
An optional vector of document names corresponding to the length of text.
min.term.freq
The minimum times a term must appear to be included in the DocumentTermMatrix.
min.doc.len
The minimum words a document must contain to be included in the data structure (other wise it is stored as a removed element).
stopwords
A vector of stopwords to remove.
min.char
The minial length character for retained words.
max.char
The maximum length character for retained words.
stem
Logical. If TRUE the stopwords will be stemmed.
denumber
Logical. If TRUE numbers will be excluded.
Value
Returns a list containing:
dtm A tf-idf weighted DocumentTermMatrix
text The text vector with unanalyzable elements removed
removed The indices of the removed text elements, i.e., documents not meeting min.doc.len
n.nonsparse The length of the non-zero elements
Examples
data_store(presidential_debates_2012[["dialogue"]])
## Use `combine` to merge text prior to `data_stare`
library(textshape)
library(dplyr)
dat <- presidential_debates_2012 %>%
dplyr::select(person, time, dialogue) %>%
10
get_documents
textshape::combine()
## Elements in `ds` correspond to `dat` grouping vars
(ds <- with(dat, data_store(dialogue)))
dplyr::select(dat, -3)
## Add row names
(ds2 <- with(dat, data_store(dialogue, paste(person, time, sep = "_"))))
rownames(ds2[["dtm"]])
## Get a DocumentTermMatrix
get_dtm(ds2)
get_documents
Get Documents Based on Cluster Assignment in assign_cluster
Description
Get the documents associated with each of the k clusters .
Usage
get_documents(x, ...)
## S3 method for class 'assign_cluster'
get_documents(x, ...)
Arguments
x
...
A assign_cluster object.
ignored.
Value
Returns a list of vectors of document names.
Examples
library(dplyr)
mydocuments1 <- presidential_debates_2012 %>%
with(data_store(dialogue, paste(person, time, sep="-"))) %>%
assign_cluster(k = 6) %>%
get_documents()
mydocuments1
mydocuments2 <- presidential_debates_2012 %>%
get_documents()
mydocuments2
get_dtm
get_dtm
11
Get a DocumentTermMatrix Stored in a hierarchical_cluster Object
Description
Extract the DocumentTermMatrix supplied to/produced by a hierarchical_cluster object.
Usage
get_dtm(x, ...)
## S3 method for class 'data_store'
get_dtm(x, ...)
get_dtm(x, ...)
get_dtm(x, ...)
get_dtm(x, ...)
get_dtm(x, ...)
Arguments
x
A hierarchical_cluster object.
...
ignored.
Value
Returns a DocumentTermMatrix.
Examples
library(dplyr)
get_dtm()
12
get_removed
get_removed
Get a Text Stored in a hierarchical_cluster Object
Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_removed(x, ...)
get_removed(x, ...)
get_removed(x, ...)
get_removed(x, ...)
get_removed(x, ...)
get_removed(x, ...)
Arguments
x
...
ignored.
Value
Returns a vector of text strings.
Examples
library(dplyr)
get_removed()
get_terms
13
get_terms
Get Terms Based on Cluster Assignment in assign_cluster
Description
Get the terms weighted (either by tf-idf or returned from the model) and min/max scaling associated
with each of the k clusters .
Usage
get_terms(x, min.weight = 0.6, nrow = NULL, ...)
## S3 method for class 'assign_cluster_hierarchical'
get_terms(x, min.weight = 0.6,
nrow = NULL, ...)
## S3 method for class 'assign_cluster_kmeans'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
## S3 method for class 'assign_cluster_skmeans'
...)
## S3 method for class 'assign_cluster_nmf'
...)
Arguments
x
A assign_cluster object.
min.weight
The lowest min/max scaled tf-idf weighting to consider as a documents salient

term.
nrow
The max number of rows to display in the returned data.frames.
...
ignored.
Value
Returns a list of data.frames of top weighted terms.
Examples
library(dplyr)
library(textshape)
myterms <- presidential_debates_2012 %>%
get_terms()
14
get_text
myterms
textshape::bind_list(myterms[!sapply(myterms, is.null)], "Topic")
## Not run:
library(ggplot2)
library(gridExtra)
library(dplyr)
library(textshape)
library(wordcloud)
max.n <- max(textshape::bind_list(myterms)[["n"]])
myplots <- Map(function(x, y){
x %>%
mutate(term = factor(term, levels = rev(term))) %>%
ggplot(aes(term, weight=n)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0),limits=c(0, max.n)) +
ggtitle(sprintf("Topic: %s", y)) +
coord_flip()
}, myterms, names(myterms))
myplots[["ncol"]] <- 10
do.call(gridExtra::grid.arrange, myplots[!sapply(myplots, is.null)])
##wordclouds
par(mfrow=c(5, 11), mar=c(0, 4, 0, 0))
Map(function(x, y){
wordcloud::wordcloud(x[[1]], x[[2]], scale=c(1,.25),min.freq=1)
mtext(sprintf("Topic: %s", y), col = "blue", cex=.55, padj = 1.5)
}, myterms, names(myterms))
## End(Not run)
get_text
Get a Text Stored in Various Objects
Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_text(x, ...)
get_text(x, ...)
get_text(x, ...)
get_text(x, ...)
hierarchical_cluster
15

get_text(x, ...)
get_text(x, ...)
get_text(x, ...)
Arguments
x
...
ignored.
Value
Returns a vector or list of text strings.
Examples
library(dplyr)
get_text() %>%
head()
hierarchical_cluster
Fit a Hierarchical Cluster
Description
Fit a hierarchical cluster to text data. Prior to distance measures being calculated the tf-idf (see
weightTfIdf) is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the
distance matrix supplied to hclust. method defaults to "ward.D2". A faster cosine dissimilarity
calculation is used under the hood (see cosine_distance). Additionally, hclust is used to quickly
calculate the fit. Essentially, this is a wrapper function optimized for clustering text data.
Usage
hierarchical_cluster(x, distance = "cosine", method = "ward.D2", ...)
hierarchical_cluster(x, distance = "cosine",
method = "ward.D", ...)
16
jaccard_distance
Arguments
x
A data store object (see data_store).
distance
A distance measure ("cosine" or "jaccard").
method
The agglomeration method to be used. This must be (an unambiguous abbreviation of) one of "single", "complete", "average", "mcquitty", "ward.D",
"ward.D2", "centroid", or "median".
...
ignored.
Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
)
plot(k=4)
plot(h=.7, lwd=2)
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
assign_cluster(k=6)
with(data_store(dialogue))
myfit2 <- hierarchical_cluster(x2)
plot(myfit2)
plot(myfit2, 55)
assign_cluster(myfit2, k = 55)
jaccard_distance
Optimized Computation of Jaccard Distance
Description
Utilizes the slam package to efficiently calculate jaccard distance on large sparse matrices.
kmeans_cluster
17
Usage
jaccard_distance(x, ...)
Arguments
x
A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).
...
ignored.
Value
Returns a jaccard distance object of class "dist".
Author(s)
user41844 of StackOverflow, Dmitriy Selivanov, and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/36373333/1000343 http://stats.stackexchange.com/a/89947/
7482
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
jaccard_distance()
kmeans_cluster
Fit a Kmeans Cluster
Description
Fit a kmeans cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix.
Usage
kmeans_cluster(x, k, ...)
kmeans_cluster(x, k, ...)
18
nmf_cluster
Arguments
x
The number of clusters.
...
Other arguments passed to kmeans.
Value
Returns an object of class "kmeans".
Examples
library(dplyr)
x <- with(
)
## 6 topic model
kmeans_cluster(x, k=6)
assign_cluster()
assign_cluster() %>%
summary()
myfit2 <- kmeans_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
nmf_cluster
Fit a Non-Negative Matrix Factorization Cluster
Description
Fit a robust non-negative matrix factorization cluster to text data via rnmf. Prior to distance measures being calculated the tf-idf (see weightTfIdf) is applied to the DocumentTermMatrix.
Usage
nmf_cluster(x, k = k, ...)
nmf_cluster(x, k, ...)
plot.hierarchical_cluster
Arguments
x
k
...

Other arguments passed to rnmf.
Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
)
## 6 topic model
model6 <- nmf_cluster(x, k=6)
model6 %>%
assign_cluster()
model6 %>%
summary()
## Not run:
myfit2 <- nmf_cluster(x2, 55)
summary()
## End(Not run)
plot.hierarchical_cluster
Plots a hierarchical_cluster Object
Description
Plots a hierarchical_cluster object
Usage
plot(x, k = approx_k(get_dtm(x)), h = NULL,
color = "red", ...)
19
20
print.assign_cluster
Arguments
x
k
color
...
The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage. Boxes are drawn around
the clusters.
The height at which to cut the dendrograms (determines number of clusters). If
this argument is supplied k is ignored. A line is drawn showing the cut point on
the dendrogram.
The color to make the cluster boxes (k) or line (h).
Other arguments passed to rect.hclust or abline.
presidential_debates_2012
2012 U.S. Presidential Debates
Description
A dataset containing a cleaned version of all three presidential debates for the 2012 election.
Usage
data(presidential_debates_2012)
Format
A data frame with 2912 rows and 4 variables
Details
person. The speaker

tot. Turn of talk
dialogue. The words spoken
time. Variable indicating which of the three debates the dialogue is from
print.assign_cluster
Prints an assign_cluster Object
Description
Prints an assign_cluster object
Usage
print(x, ...)
Arguments
x
...
ignored.
print.as_topic
21
print.as_topic
Prints an as_topic Object
Description
Prints an as_topic object
Usage
## S3 method for class 'as_topic'
print(x, ...)
Arguments
x
...
An as_topic object.
ignored.
print.compare
Prints a compare Object.
Description
Prints a compare object.
Usage
## S3 method for class 'compare'
print(x, digits = 3, ...)
Arguments
x
digits
...
The compare object

Number of decimal places to print.
ignored
print.data_store
Prints a data_store Object
Description
Prints a data_store object
Usage
print(x, ...)
Arguments
x
...
A data_store object.
ignored.
22
skmeans_cluster
print.get_documents
Prints a get_documents Object
Description
Prints a get_documents object
Usage
## S3 method for class 'get_documents'
print(x, ...)
Arguments
x
A get_documents object.
...
ignored.
print.get_terms
Prints a get_terms Object
Description
Prints a get_terms object
Usage
## S3 method for class 'get_terms'
print(x, ...)
Arguments
x
A get_terms object.
...
ignored.
skmeans_cluster
Fit a skmean Cluster
Description
Fit a skmean cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the distance matrix
supplied to skmeans.
Usage
skmeans_cluster(x, k, ...)
skmeans_cluster(x, k, ...)
summary.assign_cluster
23
Arguments
x
k
...

Other arguments passed to skmeans.
Value
Returns an object of class "skmean".
Examples
library(dplyr)
x <- with(
)
## 6 topic model
myfit1 <- skmeans_cluster(x, k=6)
myfit1 %>%
assign_cluster()
myfit1 %>%
summary()
## Not run:
myfit2 <- skmeans_cluster(x2, 55)
summary()
## End(Not run)
summary.assign_cluster
Summary of an assign_cluster Object
Description
Summary of an assign_cluster object
Usage
summary(object, plot = TRUE, print = TRUE, ...)
24
write_cluster_text
Arguments
object
plot
logical. If TRUE an accompanying bar plot is produced a well.
print
logical. If TRUE data.frame counts are printed.
...
ignored.
write_cluster_text
Write/Read Cluster Text for Human Categorization
Description
Write cluster text from get_text(assign_cluster(myfit)) to an external file for categorization. After file has been written with write_cluster_text a human coder can assign categories
to each cluster. Simple write the category after the Cluster #:. To set a cluster category equal to
another simply write and equal sign follwed by the other cluster to set as the same category (e.g.,
Cluster 10: =5 to set cluster #10 the same as cluster #5). See readLines(system.file("additional/foo_turk.txt"
for an example.
Usage
write_cluster_text(x, path, n.sample = NULL, lead = " * ", ...)
read_cluster_text(path, ...)
Arguments
x
path
A pather to the file (.txt) is recommended.
n.sample
The length to limit the sample to (default gives all text in the cluster). Setting
this to an integer uses this as the number to randomly sample from.
lead
A leading character string prefix to give the cluster text.
...
ignored.
See Also
categorize
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)
write_cluster_text
write_cluster_text(ca, lead="
25
-", n.sample=10)
## Read Human Coded Categories Back In

categories_file <- system.file("additional/foo_turk.txt", package = "clustext")
readLines(categories_file)
(categories_key <- read_cluster_text(categories_file))
## Add Categories Back to Original Data Set
categorize(
data = presidential_debates_2012,
assign.cluster = ca,
cluster.key = categories_key
)
Index
kmeans_cluster, 17
Topic cosine
cosine_distance, 8
Topic datasets
assignments, 3
presidential_debates_2012, 20
Topic data
data_store, 9
Topic dissimilarity
cosine_distance, 8
jaccard_distance, 16
Topic jaccard
Topic structure
data_store, 9
nmf_cluster, 18
package-clustext (clustext), 7
plot.hierarchical_cluster, 19
presidential_debates_2012, 20
print.as_topic, 21
print.assign_cluster, 20
print.compare, 21
print.data_store, 21
print.get_documents, 22
print.get_terms, 22
read_cluster_text, 6
read_cluster_text (write_cluster_text),
24
rect.hclust, 20
rnmf, 18, 19
abline, 20
approx_k, 2
as_topic, 5
assign_cluster, 3, 6, 10, 13
assignments, 3
skmeans, 22, 23
skmeans_cluster, 22
summary.assign_cluster, 23
categorize, 6, 24
clustext, 7
clustext-package (clustext), 7
combine, 9
compare, 7
cosine_distance, 8, 15
TermDocumentMatrix, 8, 17
vector, 10
weightTfIdf, 15, 17, 18, 22
write_cluster_text, 6, 24
data.frame, 5, 6, 13
data_store, 9, 16, 18, 19, 23
DocumentTermMatrix, 4, 8, 9, 11, 15, 17, 18,
20, 22
get_documents, 10
get_dtm, 11
get_removed, 12
get_terms, 13
get_text, 14
hclust, 15
hierarchical_cluster, 11, 12, 14, 15, 15
kmeans, 18
26

Clustext

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clustext

Transféré par

Droits d'auteur :

Formats disponibles

Package clustext

Approximate Number of Clusters for a Text Matrix

logical. If TRUE the k determination is printed.

Assign Clusters to Documents/Text Elements

The height at which to cut the dendrograms (determines number of clusters). If

x3 <- presidential_debates_2012 %>%

Convert get_terms to Topics

The max number of words to show before truncation.

get_terms(ca5, .4) %>%

Merge Clusters & Cluster Categories Back to Original Data

A data set that was fit with a cluster model.

assign.cluster An assign_cluster object.

Consistent Clustering for Text Data

Adjusted Rand Index Comaprison Between Algorithms

A series of outputs from assign_cluster for various cluster algorithmns.

Optimized Computation of Cosine Distance

A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).

Data Structure for hclusttext

An optional vector of document names corresponding to the length of text.

The minimum times a term must appear to be included in the DocumentTermMatrix.

A vector of stopwords to remove.

The minial length character for retained words.

The maximum length character for retained words.

Logical. If TRUE the stopwords will be stemmed.

Logical. If TRUE numbers will be excluded.

Get Documents Based on Cluster Assignment in assign_cluster

Get a DocumentTermMatrix Stored in a hierarchical_cluster Object

Get a Text Stored in a hierarchical_cluster Object

Get Terms Based on Cluster Assignment in assign_cluster

The lowest min/max scaled tf-idf weighting to consider as a documents salient

The max number of rows to display in the returned data.frames.

Get a Text Stored in Various Objects

## S3 method for class 'skmeans_cluster'

Fit a Hierarchical Cluster

A data store object (see data_store).

A distance measure ("cosine" or "jaccard").

Optimized Computation of Jaccard Distance

A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).

Fit a Kmeans Cluster

A data store object (see data_store).

The number of clusters.

Other arguments passed to kmeans.

Fit a Non-Negative Matrix Factorization Cluster

A data store object (see data_store).

person. The speaker

Prints an assign_cluster Object

Prints an as_topic Object

Prints a compare Object.

The compare object

Prints a data_store Object

Prints a get_documents Object

Prints a get_terms Object

Fit a skmean Cluster

A data store object (see data_store).

logical. If TRUE an accompanying bar plot is produced a well.

logical. If TRUE data.frame counts are printed.

Write/Read Cluster Text for Human Categorization

A pather to the file (.txt) is recommended.

A leading character string prefix to give the cluster text.

## Read Human Coded Categories Back In

Vous aimerez peut-être aussi