eSpecialtyCorp was once a leading ecommerce retailer of specialty automotive parts and aftermarket accessories. Unlike commodity retail, selling automotive accessories introduces an interesting and challenging complication: nearly every product has detailed vehicle fitment information. The permutations can be dizzying. As you might imagine, this complication early on forced the engineering team to design a considerable amount of custom software as the company grew.

So, what's the problem?

As eSpecialtyCorp grew from a small ecommerce startup to a company serving millions of customers, our engineering teams started to suffer from typical growing pains:

  • a proliferation of patchwork technology
  • a massive monolith, strongly coupled to several front-end applications
  • ample amounts of "glue and duct tape" holding the pieces together

Once an environment that was easy for a small team of engineers to manage and reason about, it had ballooned and become unwieldy for even our most experienced senior engineers to work, iterate, and operate effectively. Our database footprint alone grew from three databases to several dozen (multiplied by integration, QA, and production environments). We knew that if we wanted to continue scaling, we needed to adopt modern patterns while also leveling up our engineering practice.

Monolith to Microservices; Maturing an Engineering Organization

eSpecialtyCorp had embraced microservices. We were on our way to decomposing our monolithic applications opportunistically as we'd designed new features. We'd built lots of automation around provisioning infrastructure, configuration management, and build-and-deploy tooling. Lots of elements were moving more smoothly than before, except for the database.

Remember those three databases? Nearly everyone in the technology group (and several folks outside of the group) had access to production environments, including access to sometimes sensitive and personally identifiable information (PII). We had a serious problem on our hands that posed too much risk and had to solve before our next PCI audit. The Payment Card Industry (PCI) mandates that merchants have authorized third-parties to audit their security and operating practices around how they store and manage PII.

We immediately relinquished access to the production databases from most of the team, sans our systems administrators, which immediately backfired. We'd hamstrung our teams; they couldn't as quickly diagnose problems with their apps, and they couldn't process customer support tickets, which routinely required query access to production databases. Our SLAs slipped, and our customer service started to suffer.

/gmdb to the rescue.

After quickly reinstating production access to a few engineers on each team, we knew we needed a better system in place long-term.

We started by establishing the following guiding principles for our approach.

  • No one should have permanently elevated or administrative privileges to the systems that support our software. To reduce risk, we should operate in a mode that provides the least privileges possible to get our jobs done.
  • Teams must have elevated access to systems when they need it.
  • Elevated access must be intentional and deliberate.
  • Elevated access must be temporary.
  • Elevated access must be transparent and auditable.

We'd start with our database environment first, then move onto the remaining infrastructure components, servers, load balancers, firewalls, etc. Ultimately, helping our teams get the resources they needed to build and maintain good software.

In a matter of days, we'd shipped version 1.0 to our engineering teams.

The solution.

Our teams were already well-versed in collaborating asynchronously utilizing tools like Mattermost (an open-source Slack alternative). We'd started to adopt the "ChatOps" philosophy; most of our builds were reporting into their respective channels, helping create transparency around how our teams operated and helping them ship code more frequently.

We developed a series of Powershell scripts and access control lists (although fairly rudimentary, more than sufficient to test our assumptions for v1.0!) and tied them to what Slack calls "slash commands" allows users to execute commands right inside of a channel.

Here’s how the basic interface was constructed in v1.0:

/gmdb --database-server “[database-server]” --database-name: “[database-name]” --access: [read | write | pii | admin]

After entering the slash command into the channel, a Powershell script would run performing the following actions

  1. Validated input
  2. Validated the user was permitted to request access
  3. Provisioned temporary access for the user; automatically expiring after 30 minutes of inactivity
  4. Logged the request in the channel

By implementing this simple solution, we'd quickly achieved all of most of our goals. People no longer had permanent access to production environments, limiting inadvertent changes or mistakes. We now had more sophisticated logging and controls in place in support of an eventual PCI audit. And most importantly, our engineering teams could serve their customers effectively while limiting risk!

Delivering a solution like gmdb was a critical milestone in helping our teams understand how to make progress while maturing our engineering practice. That we should expect to encounter some rough patches along the way, but if we remained disciplined and could take a step back to help us better understand the problem, that we'd be more likely to succeed.