DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Text Mining in R: Unique Terms Per Document

Text Mining in R: Unique Terms Per Document

Code snippets on how to use the R tm package to count terms once per document.

Mark Needham user avatar by
Mark Needham
·
Apr. 30, 16 · Big Data Zone · Tutorial
Like (3)
Save
Tweet
7.21K Views

Join the DZone community and get the full member experience.

Join For Free

I’ve been doing a bit of text mining over the weekend using the R tm package and I wanted to only count a term once per document, which isn’t how it works out the box.

For example, let’s say we’re writing a bit of code to calculate the frequency of terms across some documents. We might write the following code:

library(tm)
text = c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus = VCorpus(VectorSource(text))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))

> tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2

> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    2     2     2     2     2     2

We’ve created a small corpus over a vector which contains two bits of text. On the last line we output a TermDocumentMatrix, which shows how frequently each term shows up across the corpus. I had to tweak the default word length of three to make sure we could see ‘am’ and ‘cool’.

But we’ve actually got some duplicate terms in each of our documents, so we want to get rid of those and only count unique terms per document.

We can achieve that by mapping over the corpus using the tm_map function and then applying a function which returns unique terms. I wrote the following function:

uniqueWords = function(d) {
  return(paste(unique(strsplit(d, " ")[[1]]), collapse = ' '))
}

We can then apply the function like so:

corpus = tm_map(corpus, content_transformer(uniqueWords))
tdm = as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf))))

> tdm
       Docs
Terms   1 2
  am    1 0
  cool  0 1
  i     1 0
  is    0 1
  mark  1 0
  neo4j 0 1

> rowSums(tdm)
   am  cool     i    is  mark neo4j 
    1     1     1     1     1     1

And now each term is only counted once. Success!

R (programming language) Text mining Document Mining (military)

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Testing Schema Registry: Spring Boot and Apache Kafka With JSON Schema
  • Toying With Kotlin’s Context Receivers
  • What Is Cloud-Native Architecture?
  • A Guide to Understanding Vue Lifecycle Hooks

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo