Why (and How) We Left App Engine After It Almost Destroyed Us
Settle in for a war story as the co-founder of Codename One details what went wrong with Google App Engine and the steps the startup took to fix them.
Join the DZone community and get the full member experience.Join For Free
Last week, I saw this post, which reminded me a lot of our experience with App Engine a few years ago. I shared that in a comment, which got some attention, so I think it’s worthwhile to describe what we went through and how we got out (almost) of the sand trap that is Google App Engine.
But first, some background: our startup uses a very complex set of servers hosted almost everywhere over the years, from Azure, to AWS, Linode, Digital Ocean, and many others. The reason for this complexity relates to our core product, which requires domain-specific servers.
To simplify our initial launch and to scale properly, we chose App Engine. As we’re Java guys, this made a lot of sense. The main goal for picking PaaS over IaaS is simplicity. We saw it as a shortcut so we can focus on our mobile platform and not on managing servers.
For the first couple of years, things worked fine. We had some issues, to be sure; a bug in the Eclipse plugin (we used Eclipse for this because we started way back and that was the recommended approach) caused a bad deployment (picked the IDE JDK and didn’t specify class version). Our servers were down for hours and the logs were completely cryptic.
We still liked App Engine after that and we even did a talk at JavaOne where we discussed how helpful it had been for us to scale our business rapidly.
To prevent issues like the downtime we had from recurring, we decided to pay for gold support (an extra 400USD per month) so such issues would not recur. A Google rep wrote to me trying to arrange a call, but nothing really happened due to scheduling conflicts and I never physically talked to a Google rep in relation to the gold support. I did meet with a couple of Google reps at their local offices before upgrading to gold and had some discussions about App Engine with them, but mostly an abstract line of business talks.
In March 2015, our monthly “data read ops” suddenly jumped from $70 spend to four digits. Being a busy startup, we didn’t notice it until the bill arrived and we were already on the April bill!
The thing is, we didn’t change anything, or at least didn’t notice any change as we were pretty busy during that time. To this day, I have no idea what went wrong, but I’m getting ahead of myself.
The “App Engine Datastore Read Ops” billing line item is pretty opaque; it effectively means we read from the datastore too often. Google recommends using Memcache for frequent data access, which we did, but “somewhere,” Memcache didn’t work. The problem is that this is a needle in a haystack!
If we’d have logged every datastore access, we would have had huge, unreadable logs to go through. This isn’t something one can debug in App Engine. On a regular database, we would have been able to place triggers on tables to at least see which table was causing the problem, but we didn’t have that level of reporting. Google had some form of monitors you can install but they provided reports that didn’t help at all.
Naturally, we called on Gold support, to whom we even sent our full project source code. They concluded (after reviewing their logs) that the problem was on our side…
I don’t dispute that. What I do dispute is charging for something you can’t possibly control or monitor!
This was a disaster, since we were getting a bill so large it nearly wiped our revenue. Being bootstrapped and in the early stages of monetization, this had the potential of sending us into bankruptcy. Google suggested placing charge limits on the account, which effectively means bringing the site down. That’s an insane suggestion for a service whose whole purpose of existing is “scale”.
The problem is that Google only updates billing once a day (or 12 hours, I don’t recall exactly since I’m recounting details from 2015), so there was literally no way to know if a fix we made works until we get billing data the next day.
We just cached every possible thing, removing everything that wasn’t essential over and over. Since there was no way to debug this, it included a lot of guesswork and finger-crossing, hoping we weren’t making things worse.
The billing eventually went down, but to this day, we have no way of knowing which one of our changes fixed it.
Why Is This Google's Fault?
Clearly, something in our code broke, right?
I will take some blame, mostly for picking App Engine, but not for this issue.
The reason companies and individuals pick App Engine is to reach a “Google scale” business relatively easily. That’s the reason we picked them, we wanted to avoid a lot of the complexities that come with deploying and managing individual servers.
The main fault on Google’s side is in opaque billing. When I get a phone bill, I get itemized details on why the charges were made. That’s important— so if my toddler took the phone and started dialing random numbers, it’s my fault and the phone company will point me at the problematic numbers.
Google did no such thing. They listed an opaque item within the line items and no information about the actual source or a way to debug this (at least not back in 2015). Gold support didn’t help, so even if there was a way to do it, the fact that their paid support tier didn’t help is a crucial fact.
Migration Away — Alternatives Are WAY Better
We still have pieces of code in App Engine, as migrating the database away is really hard and this isn’t our chief priority. However, having said that, migrating away from App Engine gave us HUGE unforeseen advantages:
- Simplicity: Rewriting code was slightly tedious but turned out to be WAY simpler. Getting out of App Engine's restrictions really simplified a lot of the code. Recently, we started working with Spring Boot, which made the code even simpler
- Price: I had my doubts when Linode introduced a 5USD server, but we’ve run a Spring Boot app with MySQL on such a server and so far, it's worked beautifully. Most of our servers have higher specs, but the ability to use physical separate servers is huge.
- Scalability: This surprised me, but today, I think our new architecture is more scalable than with App Engine. To simplify the migration, we split up the various pieces to separate servers and took a microservice approach for the new services. This means that a single service failure doesn’t bring down the whole thing. Better still, a performance issue just slows things down a bit instead of breaking the bank. We could also leverage CDNs like CloudFlare to provide a level of availability and performance that is pretty great. This is obviously very hard to measure, but we didn’t have server related downtime due to the functions moved away from App Engine. The only thing that did fail once is some of our code that used S3, which went down during the recent Amazon S3 outage
- Easier reporting/debugging: I can use standard reporting tools on regular databases. I can remotely connect to the database and debug on production data to see issues users are facing. Both of these are huge.
- China: App Engine has serious issues there
As I wrote the original comment, someone asked about AppScale. We looked into them in 2015 and had some issues that I don’t recall exactly. I’m sure those have been resolved by now, but back then, we couldn’t get it to work for us.
One of my early jobs was building flight simulators, and we worked with a lot of military pilots who instilled in me a “debrief” ritual. When you do something, you need to honestly and methodologically examine your failures and see how you can avoid them in the future:
- Billing: With new services we now try to avoid flexible billing, we still use AWS‘s S3 but ideally we want to remove that. I’d rather pay more with a fixed price although reviewing some of the alternatives S3 seems expensive for our use cases
- Avoid PaaS: Yes, that might be throwing away the baby with the bathwater, but it’s really hard to get PaaS right. We had an issue with the Parse shutdown too, so if the PaaS is exposed to end users (requires a software update on their end), this is a problem.
- Microservices: Our original implementation was monolithic. It helped us launch quickly, so I don’t regret that, and don’t think it was a mistake, but moving forward, we simplified and distributed.
- Startups and smaller companies are better than big companies: I’m arguably biased here but having worked with a lot of providers, let's discuss some of the other experiences we had.
- Azure: Multiple crashes and over-billing (although nothing as serious as above, just an unclear pricing policy) with no real support.
- AWS: One of the most confusing and obtuse billing (reserved instances are both expensive and difficult).
- On the other hand, Digital Ocean was cheaper than Google and AWS. They had good service and when we chose to migrate to Linode (who were even cheaper), they were very helpful and even refunded the remaining credit!
- Linode also instantly refunded a billing mistake (on our side) and has been very helpful. The reason for this is pretty obvious: A person working for Google/Amazon is a cubical dweller in a big faceless organization. We are one in the crowd to him. For a startup, we are “the business”. You matter and the service matches. I’m not concerned about startups disappearing — the only cases where products disappeared for us was Parse and a few features in App Engine (blob support), which got deprecated.
Our annual spend on infrastructure has kept relatively steady despite business growth. In fact, in some aspects, we spend less on servers than we did when App Engine was running correctly and scalability wasn’t impacted.
Published at DZone with permission of Shai Almog, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.