Embedding JGit: A First Look
Join the DZone community and get the full member experience.
Join For FreeCouple of years ago I needed a spider to do some excavation of data
for an analytics project. I found a project called JSpider that seemed
great, and hey, the open source credo is 'use what's there, don't
recreate the wheel.' Well, that didn't turn out so well: the thing was a
total hassle. I downloaded the source and to my shock and horror found
that it was pre Java 5. So I wrote my own spider. Since then, I have had
many occasions to consider what does a spider do? It has to extract
links from pages to keep going, so it has its own needs, but what does
it do? Recently, I have been working on a scraper, knowing that I would
marry it to my spider when I was done. Because my conclusion is that in
general, spiders should focus on the discovery portion of the problem:
following threads and unearthing the underlying topological logic, and
the work of actually getting things from the page should be done
elsewhere, by either scrapers (if we have specific items we mean to
remove from the tangled web catacombs), or simple indexers if we are
going to just expose our findings to search. I will probably blog some
more about the architecture of spidering later, it's an interesting case
of being able to design for clear distinctions of responsibilities,
that can be extended.
Meantime, one day when thinking about my spider some time ago, the idea
dawned on me that perhaps a missing piece in the spider landscape is
versioning. Conceptually speaking, an interesting question is how can a
spider purport to discover things if it doesn't know what it's seen
before? In fact, in implementing spiders, a seen list is a must. This is
kind of taking it to the next diachronic dimension: there's really no
reason to reindex or rescrape the page if nothing has changed on it.
At first I was thinking about just implementing something like a
checksum, but then I thought, that's pretty stupid. I also thought that
there could be real value in maintaining diffs and history for all the
pages on the site.
So I went and got JGit.
The acquisition part was fairly smooth, though, the downloads page makes
it seem like they have a maven repository, when in fact, they don't.
(That was a good way to start my open source adoption journey: it had
kind of the feeling of Vor Dem Gesetz to it: you know the thing exists,
and it's there, but you are sent to a door that doesn't exist. Then my
trusty friend Nexus intervened and it turned out that the jgit source
was in the jboss public repo. The version numbers were different, and
long and ugly, but hey, the dependency inclusion went pretty quickly and
I was able to import the classes.
As is often the case with libraries like this, there is a low level
version of the API, and then a higher one (named porcelain). I wanted to
write a unit test that would show that I could create a repository,
check a file in, and then get a log that shows that these events really
happened.
One of the funniest discoveries while working on this piece was that the
@Rule annotation in *JUnit*, which is first off, incredibly stupidly
named, because it really does not make you think of a way to remove a
temporary file, but also, how stupid is it that you can't tell it to not
delete the temp so you can get your code working, inspecting what you
have, then take off the delete=false or whatever, and have the test
pass?? So I ended up having to write these tests twice: once to a folder
off the project root where I could see what it was doing, and then
again to the temporary folders. The calls are not that different, but
there are enough differences that you can't run the same code on both.
Overall, this was pretty painless, and I think now that there is zero
question about the consensus winner in the repository space, using git
for all kinds of other things, oh, and btw, a pure java implementation
of it, is a no brainer.
package com.ontometrics.spider.repository; import static org.hamcrest.MatcherAssert.assertThat; import static org.hamcrest.Matchers.is; import static org.hamcrest.Matchers.notNullValue; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.util.Iterator; import org.eclipse.jgit.api.Git; import org.eclipse.jgit.api.errors.ConcurrentRefUpdateException; import org.eclipse.jgit.api.errors.JGitInternalException; import org.eclipse.jgit.api.errors.NoFilepatternException; import org.eclipse.jgit.api.errors.NoHeadException; import org.eclipse.jgit.api.errors.NoMessageException; import org.eclipse.jgit.api.errors.WrongRepositoryStateException; import org.eclipse.jgit.revwalk.RevCommit; import org.eclipse.jgit.revwalk.RevWalk; import org.eclipse.jgit.storage.file.FileRepository; import org.eclipse.jgit.storage.file.FileRepositoryBuilder; import org.junit.Before; import org.junit.Rule; import org.junit.Test; import org.junit.rules.TemporaryFolder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; public class RepositoryTest { private static final Logger log = LoggerFactory.getLogger(RepositoryTest.class); @Rule public TemporaryFolder fileFolder = new TemporaryFolder(); private File repositoryFolder; @Before public void setup() { repositoryFolder = fileFolder.newFolder(".git"); } @Test public void canCreateNewRepository() throws IOException, NoHeadException, NoMessageException, ConcurrentRefUpdateException, JGitInternalException, WrongRepositoryStateException, NoFilepatternException { FileRepository repository = new FileRepositoryBuilder().setGitDir(repositoryFolder).build(); log.info("dir: {}", repository.getDirectory()); repository.create(); Git git = new Git(repository); Git.init().call(); RevWalk walk = new RevWalk(repository); RevCommit commit = null; File exampleHtml = new File(fileFolder.getRoot().getPath() + "examplePage.html"); exampleHtml.createNewFile(); FileWriter out = new FileWriter(exampleHtml); out.write("<html>"); out.write("<table>"); out.write("</table>"); out.write("</html>"); out.close(); git.add().addFilepattern(".").call(); git.commit().setMessage("Simple html file.").call(); Iterables<RevCommit> logs = git.log().call(); Iterator<RevCommit> i = logs.iterator(); while (i.hasNext()) { commit = walk.parseCommit(i.next()); log.info(commit.getFullMessage()); } assertThat(repository, is(notNullValue())); repository.close(); } }
(Insane that we can't simply format code on here a decade later...)
The main thing to note is that you want to make a different directory for the repository than for the file that is going to contain the files you will be wanting to version.
The next step will be to have the spider's page processor do a diff on each page that it has encountered before and if there are differences, enter a new version. Then the question becomes how do upstream consumers get notified. The natural response would be they could subscribe to be notified of changes (Observer). I read an interesting if showy and opaque article the other day from some *Scala* cat about how *Observer* ought be deprecated. Thought it was pretty weak really. My Butthead reading was 'words, words, words 'when used wrong, the results with Observer are suboptimal..,' words, words....'
From http://www.jroller.com/robwilliams/entry/embedding_jgit_a_first_look
Opinions expressed by DZone contributors are their own.
Comments