DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Linguistic Data Mining (text analysis) Methods

Linguistic Data Mining (text analysis) Methods

René Pickhardt user avatar by
René Pickhardt
·
Apr. 13, 12 · Interview
Like (0)
Save
Tweet
Share
9.22K Views

Join the DZone community and get the full member experience.

Join For Free

Over the weekend I met some students studying linguistics. Methods from Linguistics are very important for text retrieval and data mining. That is why, in my opinion, Linguistics is also a very important part of web science. I am always concerned that most people doing web science actually are computer scientists and that much of the potential in web science is being lost by not paying attention to all the disciplines that could contribute to web science!

That is why I tried to teach the linguists some basic python in order to do some basic analysis on literature. The following script which is rather hacked than beautiful code can be used to analyse texts by different authors. It will display the following statistics:

  • Count how many words are in the text
  • count how many sentences
  • Calculate average words per sentence
  • Count how many different words are in the text
  • Count how many time each word apears
  • Count how many words appear only once, twice, three times, and so on…
  • Display the longest scentence in the text


you could probably ask even more interesting questions and analyze texts from different centuries, languages and do a lot of interesting stuff! I am a computer scientist / mathmatician I don’t know what questions to ask. So if you are a linguist feel free to give me feedback and suggest some more interesting questions (-:

Some statistics I calculated

ulysses
264965 words in 27771 sentences
==> 9.54 words per sentence

30086 different words
every word was used 8.82 times on average

faust1
30632 words in 4178 sentences
==> 7.33 words per sentence

6337 different words
==> every word was used 4.83 times on average

faust2
44534 words in 5600 sentences
==> 7.95 words per sentence

10180 different words
==> every word was used 4.39 times on average

Disclaimer

I know that this is not yet a tutorial and that I don’t explain the code very well. To be honest I don’t explain the code at all. This is sad. When I was trying to teach python to the linguists I was starting like you would always start: “This is a loop and that is a list. Now let’s loop over the list and display the items…” There wasn’t much motivation left. The script below was created after I realized that coding is not supposed to be abstract and an interesting example has to be used.

If people are interested (please tell me in the comments!) I will consider to create a python tutorial for linguists that will start right a way with small scripts doing usefull stuff.

by the way you can download the texts that I used for analyzing on the following spots

  • ulysses
  • faust 1
  • faust 2

    # this code is licenced under creative commons licence as long as you 
    # cite the author: Rene Pickhardt / www.rene-pickhardt.de 
     
    # adds leading zeros to a string so all result strings can be ordered
    def makeSortable(w):
    	l = len(w)
    	tmp = ""
    	for i in range(5-l):
    		tmp = tmp + "0"
    	tmp = tmp + w
    	return tmp
     
    #replaces all kind of structures passed in l in a text s with the 2nd argument
    def removeDelimiter(s,new,l):
    	for c in l:
    		s = s.replace(c, new);
    	return s;
     
    def analyzeWords(s):
    	s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?",""",")","("])
    	wordlist = s.split()
     
    	dictionary = {}
    	for word in wordlist:
    		if word in dictionary:
    			tmp = dictionary[word]
    			dictionary[word]=tmp+1
    		else:
    			dictionary[word]=1
     
    	l = [makeSortable(str(dictionary[k])) + " # " + k for k in dictionary.keys()]
     
    	for w in sorted(l):
    		print w
    	count = {}
     
    	for k in dictionary.keys():
    		if dictionary[k] in count: 
    			tmp = count[dictionary[k]]
    			count[dictionary[k]] = tmp + 1
    		else:
    			count[dictionary[k]] = 1
    	for k in sorted(count.keys()):
    		print str(count[k]) + " words appear " + str(k) + " times"
     
    def differentWords(s):
    	s = removeDelimiter(s," ",[".",",",";","_","-",":","!","?",""",")","("])
    	wordlist = s.split()
    	count = 0
    	dictionary = {}
    	for word in wordlist:
    		if word in dictionary:
    			tmp = dictionary[word]
    			dictionary[word]=tmp+1
    		else:
    			dictionary[word]=1
    			count = count + 1
    	print str(count) + " different words"
    	print "every word was used " + str(float(len(wordlist))/float(count)) + " times on average"	
    	return count
     
     
    def analyzeSentences(s):
    	s = removeDelimiter(s,".",[".",";",":","!","?"])
    	sentenceList = s.split(".")
    	wordList = s.split()
    	wordCount = len(wordList)
    	sentenceCount = len(sentenceList)
    	print str(wordCount) + " words in " + str(sentenceCount) + " sentences ==> " + str(float(wordCount)/float(sentenceCount)) + " words per sentence"	
     
    	max = 0
    	satz = ""
    	for w in sentenceList:
    		if len(w) > max:
    			max = len(w);
    			satz = w;
    	print satz + "laenge " + str(len(satz))
     
    texts = ["ulysses.txt","faust1.txt","faust2.txt"]
    for text in texts:
    	print text
    	datei = open(text,'r')
    	s = datei.read().lower()
    	analyzeSentences(s)
    	differentWords(s)
    	analyzeWords(s)
    	datei.close()
    
    

    If you call this script getstats.py on a linux machine you can pass the output directly into a file on which you can work next by using

    python getstats.py > out.txt

Data mining

Published at DZone with permission of René Pickhardt, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Host Hack Attempt Detection Using ELK
  • Introduction To OpenSSH
  • [DZone Survey] Share Your Expertise and Take our 2023 Web, Mobile, and Low-Code Apps Survey
  • A First Look at Neon

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: