Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Text Mining in R: Unique Terms Per Document

DZone's Guide to

Text Mining in R: Unique Terms Per Document

Code snippets on how to use the R tm package to count terms once per document.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document, which isn’t how it works out the box.

For example, let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))

> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2

> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix, which shows how frequently each term shows up across the corpus. I had to tweak the default word length of three to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents, so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))

> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1

> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
r programming

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}