Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Simple Links Extractor Using Python

DZone's Guide to

Simple Links Extractor Using Python

· Web Dev Zone
Free Resource

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Those who have already registered for the Search Engine class by Sebastian Thrun and David Evans (both are Professors) should be familiar with this title. Last week, we had finished the first homework of grasping the Python concept while learning about how search engine’s worked in a simple way. We learned about String and integer manipulations. We learned about finding certain keywords within the string. We learned how to extract a specific text based on the pattern we wanted.

This is a fundamental concept of how web crawler and search engine indexing machine’s worked. I’m experimenting a little bit more on this based on the first lecture to create a links extractor. Since the first lecture didn’t explain about how to crawl the links more than one-level deep, I’ve just finished the extractor with no depth at all. I’ll update with more features when the lectures go deeper. I will also put the source code on Github just in case anybody is interested in modifying the codes to fit the current lectures or any other purposes.

So, without further ado, here is the code:

import sys

def linksExtractor(text):
	start = 0
	status = 1
	
	while status != -1:
		status = text.find("<a", start)
		
		if status != -1:
			start = status
			end = text.find("</a>", start)
			hrefPos = text.find("href=", start)
			linkPos = hrefPos + 6
			linkEndPos = text.find("\"", linkPos)
			linkStr = text[linkPos:linkEndPos]
			descPos = text.find(">", start) + 1
			descEndPos = end
			descStr = text[descPos:descEndPos]
			
			if (linkStr.find("http") != -1):
				print descStr
				print linkStr + "\n"
			
			start = end


try:
	fileSrc = open("sample.html", "r")
except IOError:
	print >> sys.stderr, "File could not be opened"
	sys.exit(1)
	
lines = fileSrc.readlines()

for line in lines:
	linksExtractor(line)


I’m really excited with the direction in which this class is moving. Challenges ahead.

 

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:

Published at DZone with permission of Kristiono Setyadi. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}