Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Simple Links Extractor Using Python

DZone's Guide to

Simple Links Extractor Using Python

· Web Dev Zone ·
Free Resource

Jumpstart your Angular applications with Indigo.Design, a unified platform for visual design, UX prototyping, code generation, and app development.

Those who have already registered for the Search Engine class by Sebastian Thrun and David Evans (both are Professors) should be familiar with this title. Last week, we had finished the first homework of grasping the Python concept while learning about how search engine’s worked in a simple way. We learned about String and integer manipulations. We learned about finding certain keywords within the string. We learned how to extract a specific text based on the pattern we wanted.

This is a fundamental concept of how web crawler and search engine indexing machine’s worked. I’m experimenting a little bit more on this based on the first lecture to create a links extractor. Since the first lecture didn’t explain about how to crawl the links more than one-level deep, I’ve just finished the extractor with no depth at all. I’ll update with more features when the lectures go deeper. I will also put the source code on Github just in case anybody is interested in modifying the codes to fit the current lectures or any other purposes.

So, without further ado, here is the code:

import sys

def linksExtractor(text):
	start = 0
	status = 1
	
	while status != -1:
		status = text.find("<a", start)
		
		if status != -1:
			start = status
			end = text.find("</a>", start)
			hrefPos = text.find("href=", start)
			linkPos = hrefPos + 6
			linkEndPos = text.find("\"", linkPos)
			linkStr = text[linkPos:linkEndPos]
			descPos = text.find(">", start) + 1
			descEndPos = end
			descStr = text[descPos:descEndPos]
			
			if (linkStr.find("http") != -1):
				print descStr
				print linkStr + "\n"
			
			start = end


try:
	fileSrc = open("sample.html", "r")
except IOError:
	print >> sys.stderr, "File could not be opened"
	sys.exit(1)
	
lines = fileSrc.readlines()

for line in lines:
	linksExtractor(line)


I’m really excited with the direction in which this class is moving. Challenges ahead.

 

Take a look at the Indigo.Design sample applications to learn more about how apps are created with design to code software.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}