Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DocRaptor and Its Python APIs

DZone's Guide to

DocRaptor and Its Python APIs

We take a look at an online HTML-to-PDF conversion service and its Python APIs. Sound interesting? Then we invite you to read on!

· Integration Zone ·
Free Resource

SnapLogic is the leading self-service enterprise-grade integration platform. Download the 2018 GartnerMagic Quadrant for Enterprise iPaaS or play around on the platform, risk free, for 30 days.

What Is DocRaptor?

DocRaptor is an online service that can be used to transform HTML documents into PDFs or even Excel documents. This is a paid service, but there’s a 7-day free trial so you can have a chance to try it out first. They have 8 different plans, ranging from 125 documents per month for $15 per month to 100,000 documents for $2250, plus a level for an unlimited number of documents, for which you need to contact them to set up a price.

All of the plans allow you to go over the number of documents each month, paying an overage cost equal to the what the plan was already covering ($15 for 125 documents = 12¢ per document; $2250 for 100,000 documents = 2.25¢ per document). You can, test your documents for free, but those are watermarked, and it seems they still need an API key.

In order to use the service, you submit a POST request, supplying certain options and the document you want to convert, giving the content directly or providing a URL. You also need the API Key supplied with your paid account.

These requests can be simply done with a command line call to curl (using their API reference documentation to know exactly how), but the organization also supplies libraries that you can use in Ruby, Python, Node, PHP, Java, and C#. Later on, we’ll be going over their Python APIs.

HTML as a PDF? Or as XLS? What?

Before we get into the APIs, though, we need to address the elephant in the room of how the heck they turn HTML into the other document types. It’s pretty cool, but getting into it is a little out of the scope of this post, so I’ll point you to DocRaptor’s documentation about such things. I’ve read over it, and it’s quite good.

To ease your hunger a little, though, I’ll tell you a few tidbits. When making an Excel document, the HTML contains only tables. When it comes to making PDFs, there are a several metadata attributes that can be used to set the “paper” size as well as dealing with headings, such as auto-generating a table of contents and telling it whether a heading starts on a new page or if it should avoid being at the bottom of a page. It’s really interesting, so seriously check it out.

Python API

First, I’d like to point out that, even at this stage in the game, they support both Python 2 and Python 3. The library is easily installable with pip with the name docraptor (who would have guessed?).

The API is actually quite small, especially if you don’t need to use any of the special features. In the basic example on their site, you import docraptor, configure your API key with docraptor.configuration.username = "your api key", call on a newly created instance of docraptor.DocApi (no arguments needed for the constructor), then write the returned data into a file. This is the synchronized way of doing it, and they do provide an asynchronous way (not using Python’s async system), which we’ll get to.

So that’s pretty much just 3 points of contact with their API. This doesn’t cover error handling, which adds another 1 for their docraptor.rest.ApiException.

The hardest part of working with the API seems to be passing in arguments to create_doc(). First, the documentation for it is pretty bad. The best thing you have is the example code. I dug into the code, and it’s a pretty difficult read because, as the documentation for the class says, “This class is auto-generated…” Whatever process they used for it wasn’t the best. Anyway, the in-Python documentation also seems to conflict with the examples and with itself.

Example 1, the methods claim that the main argument they need is an object of type Doc, but all 3 examples simply pass in a dictionary. Digging more deeply into the code, it actually looks like the code prefers the dictionary, but you have to dig deep in order to find any semblance of it being used instead of just passed along. This is fine, but it makes using the code’s documentation harder. To make things worse, the Doc type’s documentation lies again. The constructor docs claim to need two dictionaries as arguments, but the actual parameter list is empty, other than self. Instead, you’re supposed to set all the properties individually after creation.

Luckily, people will most likely follow the example code more than trying to read the code’s documentation (unlike me), so these don’t really represent a real problem.

The next discussion is on asynchronicity. There seem to be two different kinds of asynchronous actions here: an asynchronous request and an asynchronous creation of the document.

The first kind, asynchronous requests, they give no example of, nor do they make any mention of it other than in the in-code docs, but it kind of makes sense. It’s the type of asynchronicity you expect, and it’s triggered by passing a callback function using the callback named argument which is called when the response returns. And instead of the call returning the document, it returns the thread that the request was actually made on.

The other kind is something special, but at least they give you an example. To do it, you call create_asyn_doc() instead of create_doc(), and the response comes back quickly, returning some sort of id for your document instead of the document itself (unless you also do an asynchronous request, in which case it will return the request thread, and the callback will receive the id). With this id, you poll the service using DocApi‘s get_async_doc_status() method, which returns the status. This status is either "completed", "failed", or… nothing? In the code example, it simply uses else for any other possible statuses, and the code doesn’t set the statuses, so I have no way to look it up. When it’s not completed or failed, you wait a little bit, then try again. When it’s completed, you make one more call to the service with DocApi‘s get_async_doc() method. This requires an id from the response that gave you a "completed" status. Finally, the return from that is the same as from the synchronous call, create_doc(), which can be written to a file. It should also be noted that these last two methods can also be given a callback to do the request asynchronously.

So, that’s a textual explanation, but you can also look at DocRaptor’s example code for both synchronous and asynchronous calls.

But why do they provide that second type of asynchronous call? The problem stems from the possibility of the document taking a long time to convert, which makes it take a long time to receive a response, potentially causing a timeout. When done synchronously, DocRaptor limits that time to 60 seconds, so if you have a document that takes longer than that, you need to do it asynchronously. But even that isn’t limitless; DocRaptor limits that time to 10 minutes. After that, it can probably be assumed that you’re being a jerk by sending a gigantic file or there’s a problem and it needs to abort.

Finally coming back to it, the most complicated thing is passing in the right arguments to create_<async_>doc(). At a bare minimum, you need to specify the type of file you’re generating (pdf, xls, or xlsx) and either a string of HTML or a URL to an HTML document.

For the rest, you can look over the API documentation linked to earlier, and you can look at the examples to get an idea of what’s typical.

Conclusion

So, I got a little harsh on them in the middle there, but overall it looks like a great service that’s pretty easy to use. The hardest part is getting good documentation from the Python code itself, but if you just stick the examples, for the most part, you’ll be fine. The examples won’t help you do asynchronous requests, but, hopefully, this article, mixed with double-checking the code docs, will get you far enough with that if you need it.

Most of what you need to know is on their site, though (link to all the documentation pages is at the bottom of the site’s pages, down by their “contact” and “privacy” links), so I highly recommend using that.

Thanks for reading! Soon, I’ll be putting out a similar post on using the Java API, including using it in Kotlin.

With SnapLogic’s integration platform you can save millions of dollars, increase integrator productivity by 5X, and reduce integration time to value by 90%. Sign up for our risk-free 30-day trial!

Topics:
python ,apis ,html ,integration

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}