Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Simple Links Extractor Using Python

DZone's Guide to

Simple Links Extractor Using Python

· Web Dev Zone
Free Resource

Learn how to build modern digital experience apps with Crafter CMS. Download this eBook now. Brought to you in partnership with Crafter Software

Those who have already registered for the Search Engine class by Sebastian Thrun and David Evans (both are Professors) should be familiar with this title. Last week, we had finished the first homework of grasping the Python concept while learning about how search engine’s worked in a simple way. We learned about String and integer manipulations. We learned about finding certain keywords within the string. We learned how to extract a specific text based on the pattern we wanted.

This is a fundamental concept of how web crawler and search engine indexing machine’s worked. I’m experimenting a little bit more on this based on the first lecture to create a links extractor. Since the first lecture didn’t explain about how to crawl the links more than one-level deep, I’ve just finished the extractor with no depth at all. I’ll update with more features when the lectures go deeper. I will also put the source code on Github just in case anybody is interested in modifying the codes to fit the current lectures or any other purposes.

So, without further ado, here is the code:

import sys

def linksExtractor(text):
	start = 0
	status = 1
	
	while status != -1:
		status = text.find("<a", start)
		
		if status != -1:
			start = status
			end = text.find("</a>", start)
			hrefPos = text.find("href=", start)
			linkPos = hrefPos + 6
			linkEndPos = text.find("\"", linkPos)
			linkStr = text[linkPos:linkEndPos]
			descPos = text.find(">", start) + 1
			descEndPos = end
			descStr = text[descPos:descEndPos]
			
			if (linkStr.find("http") != -1):
				print descStr
				print linkStr + "\n"
			
			start = end


try:
	fileSrc = open("sample.html", "r")
except IOError:
	print >> sys.stderr, "File could not be opened"
	sys.exit(1)
	
lines = fileSrc.readlines()

for line in lines:
	linksExtractor(line)


I’m really excited with the direction in which this class is moving. Challenges ahead.

 

Crafter is a modern CMS platform for building modern websites and content-rich digital experiences. Download this eBook now. Brought to you in partnership with Crafter Software.

Topics:

Published at DZone with permission of Kristiono Setyadi. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}