On Internal Engineering practices at Amazon

A company that's innovating how rest of the companies work, doesn't innovate internally.

I should start with a side note about the title of this post. I initially called this "Why do internal tools at Amazon suck?" as that sounded better for eye grab while posting on HN. But then I realized that'd be a bit overly critical and unfair to the teams that work so hard on internal tools. However, This article is a mix about internal tools, engineering practices and culture. So, it touches on more things than the title suggests.

I was recently talking to a college friend. He had called me to wish me on my birthday. He had recently joined Amazon. I asked him about how he was finding Amazon and there were a couple things he mentioned.

"There internal tooling is sophisticated and great, but people are ruthless here"

He talked about how his first deliverable didn't go well.

"I took 2 hours to code the whole thing, but then the getting it from my machine to actually deploying, and getting it to run on servers took forever."

And as someone who worked at Amazon for 2 years right after college, it completely resonated with me. Yes, there is a general lack of empathy in the culture that has been written fairly extensively about, so let's talk about what most people expect to be good inside Amazon - the tech part - their internal tools.

Most of my friends who are there, describe the internal tooling as "sophisticated". I think a part of that comes from Amazon being their first and only job, so they don't have much to compare it with. And since as an Amazon SDE, you are mostly working with internal tools, it can be easy to lose track in the developments of outside tech ecosystem, unless you spend 4 hours a day on Hacker News. ("Kubernetes? What's that? I am sure Apollo is better")

Also, when I hear them call it "sophisticated", I think somewhere they are saying "It's complicated, so it's gotta be good. I must be dumb to not get it."

Exhibit A: Deployments

Their internal deployment tool at Amazon is Apollo, and it doesn't support auto-scaling. So, everytime around "peak", we had a peak scaling exercise, where you manually provision more hosts (you order them, and then wait for approval, you can't just click a button like you'd in AWS). At the end of it, you get a T-shirt saying "I survived peak", which should ideally say "I survived shitty engineering practices.".

While the rest of industry was promoting Infrastructure as code, we used to manually recreate environments in 4 different regions, manually copy Apollo configurations. Imagine a shiny eyed CS grad, keeping two tabs open and copy pasting things from one to other. The entire thing was so error prone, but still it was the standard way of doing things. This should be an interview exercise rather that stupid rotate the array questions.

There used to sprint time allocated to this shit.

One time, during the standup, a guy mentioned that he spent around 1 hour on deployment yesterday and his manager asked "Well, what were you doing when the deployment was happening? Looking at it?"

Exhibit B: Logs

Any self respecting company running software on distributed machines should have centralized, searchable logs for their services. The memories of ssh-ing into multiple hosts in multiple terminal tabs, then firing a request to see which machine got the request still haunt me. Then to get past this, someone created a wrapper (called remote-command) which'd run a command on multiple hosts, so you could grep on multiple hosts, but I have to admit that I could never get it to work.

This company makes CloudWatch. [1]

Exhibit C: Service Discovery

What service discovery? We used to hard wire load balancer host names in config files.

One time we had to do JDK8 migration on a service, that just kept bouncing from one team to other. Because the reorgs were the only ways manager could improve the efficiency of teams. Nobody chants Ownership more than Amazon managers. But ownership in Amazon is a one way street, which is you saying "Yes" to whatever you are given. The number of re-orgs and services bouncing from one team to other is the very opposite of Ownership.

Anyway, so we had to migrate this service, and someone asked "Well, who's using it?". And nobody had answer to that question. This company is doing SOA for last 10 years and nobody knew how to figure out a definitive way to find which service was talking to which. This company had posters of "Ownership", where nobody in the team owning the service knew who was using it.

So how did we do it? We just decided to shut down the service and see if anyone complained. No one did, and the service stayed shut.

Exhibit D: Containers

While containers were taking all over the world in last 4 years, they were unheard of inside Amazon. I am sure, in some design meeting, some junior engineer mentioned containers and was shut down because the L7 manager couldn't care less. He had deliverables to take care of.

The amount of productive man hours lost struggling with Amazon's internal systems used to be insane, yet everybody used to take it as a way of life. The aim is to make the tools easier to use, not to get used to the pain.

Amazon does not experiment

Now, I am not saying, a company has to jump to shiny new thing, but as a tech leader, you should be experimenting and paving the way forward. And that's a general problem with Amazon. They. Don't. Experiment. (unless it brings them money, in which case, sure Alexa)

How many open source tools and libraries have come out of Amazon?

Facebook contributed React; Google gave us TensorFlow, Kubernetes, and lot more. Microsoft, which according to Bezos is on it's Day 2 is largest OSS contributor on Github. Twitter, LinkedIn Square, AirBnb - I could go on. But Amazon doesn't give a fuck. Because giving back to community takes time, but nobody inside has time to look beyond next deadline.

Amazon will be the first one to take an open source tool like Presto, and sell it as a hosted service, but they won't come up with Presto.

But OSS contributions isn't the only area where Amazon lags behind. Facebook thought PHP wasn't good enough, so they developed Hack. Google is a big Java/C++ shop, but they developed Go to better serve their developers. A small startup like Jetbrains developed Kotlin because they wanted a modern language for JVM.

But Amazon, which has thousands of services written in Java never felt the need for a better language? I find it hard to believe do. And I am not talking about inventing a new language, not every company has to do that. The problem is that internally, at Amazon, you can't even use a language other than Java. They have some amount of Rails, and JavaScript has to be there, but if you want to experiment with, say, Go, Kotlin, or anything else, you are going to get nothing but push back. I thought one of the advantages of SOA was that teams could chose their stack, but that's something you don't see at Amazon. (As far as I know, Node.js was used inside in a limited capacity, but they had an alternative for npm for security reasons, and you had to get an npm package approved to use it internally.)

Take Facebook for example, 50 percent of messenger.com code is written in ReasonML, which is definitely not a mainstream language. And that's a product used by a billion users. That requires treading on uncomfortable technical grounds, something an Amazon team will never do. (again, unless the have to, unless it's critical to their revenues)

There's a thin line between standardization and discouraging experimentation and Amazon falls way on the wrong side of that line.

1. During my time there, there was an internal movement called MAWS - Move to AWS - going on, which aimed to migrate all teams to AWS. Not sure if that ever materialized.