What I learned about Microsoft doing DevOps: Part 7 – How to shift quality left in a cloud and DevOps world

2018-01-09

Quality and testing is an essential part of application development. This is still true for a DevOps environment where you practice continuous delivery. But when moving from testing in a waterfall world to continuous delivery some things have to change.

In this post I’ll discuss what testing looks like in the modern world. I’ll add some examples from how Microsoft has made these changes while building Visual Studio Team Services and share some best practices that I encountered while working with customers.

This is part 7 of a series on DevOps at Microsoft:

The Old Way

I see that a lot of organizations still have distinct developer and test roles. This is very normal when developing software with a waterfall process where development and test are separate activities that take places in distinct phases.

If you look at the 1990s, Microsoft also had separate roles for dev and test. They used three different roles:

Developer
Software Design Engineer in Test (SDET) developed test automation & test infrastructure
Software Test Engineer (STE) ran automated and manual tests

For that point in time this was a successful model. Microsoft shipped huge products like Office and Windows using these testing techniques.

However, these separation in roles did lead to problems. Testing became a bottleneck in the process where code was just thrown over the wall from Developer to SDET to STE.

Circa 2000 Microsoft removed the STE role. The SDET became responsible for creating and running all tests. This helped but testing was still a bottleneck.

Then came the cloud. Everything had to go faster. There where no more beta and release candidate releases where quality could be verified. Running a cloud service also meant that no downtime was allowed and quality became even more important. Combine that with the shift to microservices that are deployed independently and had to work against different and ever changing versions of other microservices and you understand that doing things the old way wasn’t feasible anymore.

You probably recognize these issues and deal with some of these on a day to day basis. What did Microsoft change to adopt to the new cloud world and what can we learn from it?

Quality in the cloud

I find the following three principles are important when moving to the cloud with a continuous delivery cadence:

Own Quality

Having separate development and test roles will always lead to problems when it comes to owning quality. Is it the responsibility of the developer to write bug free code or of the tester to find the bugs? Of course a culture of finding blame is not what you want but often that’s what you see when there is no clear owner of quality. I always suggest merging the development and test roles into one. Microsoft merged these roles into a role called engineer. This role is responsible end to end for quality in your product. This means that testers will have to learn development skills and developers test skills. You can’t throw any code over the wall since there is no longer a wall. This makes all team members feel responsible for the quality of your product and will lead to a better product and faster cycle time.

Master is always shippable

Complex branching strategies with multiple branching levels containing service packs, hotfixes and past releases are a thing of the past. You can’t keep this up if you move to the cloud and continuous delivery. Every merge costs time and is a possible source of bugs. Simplifying your branching strategy and guaranteeing the quality of your branches and especially your master branch is a huge step.

Start with using Git as a source control system combined with pull requests and continuous integration builds. That way you automate a lot of checks before code gets into master. The biggest change most companies have to go through is to create a set of very fast executing unit tests that give them the confidence that they can release code to production. The following principles help you get started:

Tests should be written at the lowest level possible
Write once, run anywhere including production system
Product is designed for testability
Test code is product code, only reliable tests survive
Testing infrastructure is a shared service
Test ownership follows product ownership

The first bullet, writing tests at the lowest level possible, is the one where I see most problems. Most companies don’t have real unit tests. Instead they have a set of integration tests that take long to run, easily break and are flaky. Consider adapting the following levels for your tests as defined by the Microsoft VSTS team:

L0/L1 - Unit tests
- L0 – Broad class of fast in-memory unit tests
- L1 – Unit tests with more complex requirements e.g. SQL
L2/L3 - Functional tests
- L2 – Functional tests run against “testable” service deployment
- L3 – Restricted class integration tests that run against production

You should have a lot of level 0 and 1 tests. Some level 2 test and as few as possible level 3 tests. Level 0 and 1 are the tests that you run in your pull request flow. Level 2 runs in your continuous integration build and level 3 runs in your release pipeline.

There is no place like production

Creating a test environment that fully resembles production is hard if not impossible. This not only has to do with the environment but also with actual customer traffic. This means that to really up your quality, you have to be able to run tests in production, monitor the results and quickly respond.

In previous posts I already spoke about practices like ring deployments and feature flags. These practices allow you to slowly expose more and more of your customers to changes and quickly roll back when something goes wrong.

Another important part of production testing is chaos testing. By deliberately injecting faults into your production environment, you can test how your system responds and if you can handle failures gracefully. For example, in the architecture post I looked at circuit breakers. A circuit breaker will help you build resilient applications but only if the circuit breaker is tuned correctly. And that’s something you can only test in production.

Conclusion

Testing in a modern world has changed a lot. No more separate test role, more and more automated tests and testing in production are key. To summarize try applying the following principles:

Combined Engineering drives better accountability

E2E ownership drives right behavior. Reduced handoffs improve agility.

Focus on building a fast and reliable quality signal

Shift-left is not just a slogan. It’s possible to invert the test pyramid. Write tests at the right levels and automate as much as possible.

Release to production is only half the job

Ensuring quality at scale in production with real workloads is very important. There is no place like production for that.

That’s it for this post. Of course there is a lot more to say about testing but applying these principles will give you a head start to improve quality and agility in a cloud and DevOps world.