Simple Links Extractor Using Python
Join the DZone community and get the full member experience.Join For Free
Those who have already registered for the Search Engine class by Sebastian Thrun and David Evans (both are Professors) should be familiar with this title. Last week, we had finished the first homework of grasping the Python concept while learning about how search engine’s worked in a simple way. We learned about String and integer manipulations. We learned about finding certain keywords within the string. We learned how to extract a specific text based on the pattern we wanted.
This is a fundamental concept of how web crawler and search engine indexing machine’s worked. I’m experimenting a little bit more on this based on the first lecture to create a links extractor. Since the first lecture didn’t explain about how to crawl the links more than one-level deep, I’ve just finished the extractor with no depth at all. I’ll update with more features when the lectures go deeper. I will also put the source code on Github just in case anybody is interested in modifying the codes to fit the current lectures or any other purposes.
So, without further ado, here is the code:
import sys def linksExtractor(text): start = 0 status = 1 while status != -1: status = text.find("<a", start) if status != -1: start = status end = text.find("</a>", start) hrefPos = text.find("href=", start) linkPos = hrefPos + 6 linkEndPos = text.find("\"", linkPos) linkStr = text[linkPos:linkEndPos] descPos = text.find(">", start) + 1 descEndPos = end descStr = text[descPos:descEndPos] if (linkStr.find("http") != -1): print descStr print linkStr + "\n" start = end try: fileSrc = open("sample.html", "r") except IOError: print >> sys.stderr, "File could not be opened" sys.exit(1) lines = fileSrc.readlines() for line in lines: linksExtractor(line)
I’m really excited with the direction in which this class is moving. Challenges ahead.
Published at DZone with permission of Kristiono Setyadi. See the original article here.
Opinions expressed by DZone contributors are their own.