What I learned about Microsoft doing DevOps: Part 3 - A day in the life of an Engineer

2017-11-14

How often do you get the chance to work on a product that’s used by millions of users? And as an extra benefit, you even understand what you’re building and use the product from day to day. That’s what the engineers on the VSTS team experience. As you’ve seen in the previous parts (part 1, part 2), the move to the cloud and the increased deployment cadence changed a lot when it comes to planning and actually running the teams. In this part I want to focus on some of the changes the VSTS team made from the perspective of the engineers.

This is part 3 of a series on DevOps at Microsoft:

Dogfooding

Eating your own dog food luckily isn’t meant literally. Instead, eating your own dogfood revers to using your own product to develop and promote the product. For the VSTS team this means they want to use VSTS as much as possible to build VSTS. In the past, the teams relied on a bunch of tools that weren’t part of VSTS such as a custom Test Explorer and Tfs Runner. Having such internal tools means that the teams experience isn’t the same as that of customers. Fixing this by moving the functionality that Microsoft required into VSTS means that VSTS becomes a better product and that the teams are really eating their own dogfood.

Figure 1 A custom Test Explorer used within the teams

Under the name 1ES (1 Engineering System), the VSTS team is responsible for building all the tools that are needed by Microsoft to build software. This includes huge teams like toe ones building Windows and Office. Sometimes this means that extra functionality has to be built to support those teams. A great example of this is Git Virtual File System. GVFS was built by the VSTS team to make sure that the enormous repository that the Windows team uses works great on Git. This work was then added to VSTS and the required Windows driver changes are added to a feature update of Windows. And as we’re used to, Microsoft completely open sourced GVFS so others can use it and make suggestions. The whole dogfooding effort and the 1ES efforts make VSTS an incredibly attractive product.

The biggest change in the life of a VSTS developer (and tester) is the move to a combined engineering role (see part 1 for more details). Where certain internal tools were previously used by testers, these tools are now used by the whole engineer team. This brought issues to the surface such as who owned these tools and who paid for maintaining them. Folding these tools into the product helped fixing these issues.

Everyone on Master

One of the biggest issues that the team faced when it came to quality and engineer was branching. Branching is hard. The (now retired) branching guidance by the ALM Rangers shows a multitude of branching models that range from a single branch to complex schemas for multiple versions, service packs and hotfixes. The following figure shows the branching model used during Dev12. The team had a fulltime engineer who was responsible for merging changes between branches. This was kind of a magic black art and this person was held in high regard. Having long running branches made integration hard. It also made testing difficult since you never knew against which changes your new code would run. Sometimes it took weeks to figure out why a certain feature didn’t work anymore after merging it into a release branch. In a cloud cadence world where you want to optimize flow and release as often as possible this branching model is impossible to maintain.

Figure 2 The branching model used with Dev12

The VSTS team switched from this complex branching structure to a very simple one: Master Only. All the source is in a single repository. This repository has a master branch and all work is done by creating a short-lived feature branch on top of master. The engineers make small changes on their feature branch and are then responsible for merging their changes into master. Combine this with Git and the team now has a workflow where an engineer creates a branch, pushes his changes and then does a pull request to master. The pull request is automatically validated by a set of builds and requires reviewers before allowing the merge to succeed. Build policies run to catch common mistakes like checking in secrets. All this automation helps the team to merge to master as much as possible.

For an example of how to implement this yourself see Review code with pull requests and Improve code quality with branch policies.

An important aspect of this is automated testing. Although the VSTS code base is huge, developers can work with the code in Visual Studio. The product code and test code are always collocated. This allows the engineer to have a fast feedback loop where they can make a change and immediately run their tests.

To help with this the team has split tests into a couple of categories:

L0/L1 - Unit tests
- L0 – Broad class of fast in-memory unit tests
- L1 – Unit tests with more complex requirements e.g. SQL
L2/L3 - Functional tests
- L2 – Functional tests run against “testable” service deployment
- L3 – Restricted class integration tests that run against production

Before the team started this journey, most tests were L3 tests.

Figure 3 The team moved from a lot of L3 tests to L0 and L1 tests with some L2 and L3 tests

In a deliberate effort, most tests where rewritten until there where a lot of L0 tests, some L1 tests and even fewer L2 and L3 tests. All these tests can be run by an engineer on her local system. Pull requests run a subset of these builds by default and an engineer can choose to run more test suites if required. When merging the changes to master, all tests run.

To help even further, the deployment code for a local deployment on an engineer’s system is the same as the scripts that run during a production deployment. This helps eliminate deployment and setup issues. By making it easier for engineers to run their code locally the quality of the pull requests improves.

Figure 4 A pull request showing the policies and test builds

But my code isn’t ready

I hear you think: ‘I can’t merge to master as long as my feature isn’t ready!’. And that sounds reasonable. Having half-baked features into master being released into production sounds as a nightmare. Except if you start using feature toggles. The idea behind feature toggles is very simple: you hide your changes behind a toggle that’s either on or off. VSTS uses a custom build feature toggle system. Not because they really wanted to build their own but because at that time there was no feature toggle framework that suited their needs.

Take the following example. Imagine you have to build new revert functionality for a pull request. This is a simple button that’s added to the UI that exposes the revert functionality.

Figure 5 A new revert command is added to the pull request UI

To hide this button behind a feature toggle, an engineer first needs to define the toggle in a configuration file:


<?xml version="1.0" encoding="utf-8"?>
<!--_
 In this group we should register TFS specific features and sets their states.
\-->

<ServicingStepGroup name="TfsFeatureAvailability">
 <Steps>
 <!-- Feature Availability -->
 <ServicingStep name="Register features" stepPerformer="FeatureAvailability">
 <StepData>
 <!--specifying owner to allow implicit removal of features -->
 <Features owner="TFS">
 <!-- Begin TFVC/Git -->
 <Feature name="SourceControl.Revert" description="Source control revert features" />

The following code will then check if the toggle is enabled and decide if the button should be visible:


private addRevertButton(): void {
 if (FeatureAvailability.isFeatureEnabled(Flags.SourceControlRevert)) {
     this._calloutButtons.unshift(
         <button onClick={ () => Dialogs.revertPullRequest(
             this.props.repositoryContext,
             this.props.pullRequest.pullRequestContract(),
             this.props.pullRequest.branchStatusContract().sourceBranchStatus,
             this.props.pullRequest.branchStatusContract().targetBranchStatus)
         }
         >
             {VCResources.PullRequest_Revert_Button}
         </button>
        );
     }
}

And that’s it. Now the button is only shown when the feature toggle is on. Turning the toggle of doesn’t require a redeploy or any code changes. By using a smart feature toggle system you can even enable toggles for certain users and allow a gradual rollout of your new feature. Feature toggles are the magic sauce that make everyone working on master possible. Code that isn’t finished yet can be merged to master and deployed to production without any problems. Of course ‘not finished’ still means that the code compiles and all the L0-L3 tests run successful.

If you’re interested in using feature flags then have a look at my previous blog post Playing with Feature Flags and LaunchDarkly.

What if something goes wrong

You’re happily working on your feature. You’ve created a feature toggle, a short lived branch, you’re running your L0-L3 tests and you are in the flow. But you’re also a DevOps team. This means that as an engineer you’re not only responsible for building new features. You’re also responsible for keeping production up and running. And that could mean that your role suddenly changes when something happens. These production problems create new work that needs to be done. You not only have to fix the issue, you also need to make sure it doesn’t happen again and so you add work to the backlog of your team.

This process creates a couple of issues. First, context switching costs time. Being interrupted by a production issue disrupts your flow. Secondly, the items that are added to the backlog for production issues aren’t as interesting as the new work that has to be done. Because of this, these items often find themselves sitting on the backlog without being picked up by the team.

The VSTS team encountered all these issues. To fix this, they created a role named Live Site Engineer. A team self-organizes into sub teams. The feature team focuses on building new features. The Live Site Team deals with all live site issues and interruptions. When there are no interruptions, they work on the live site issues that sit on the backlog. The size of the Live Site Team may vary based on live site debt and demand and rotates each sprint.

Figure 6 The F-Team and L-Team sub teams.

There are many ways to structure a team to deal with these kinds of situations. On http://web.devopstopologies.com/ you can find examples of what different organizations have tried.

Conclusion

Being an engineer on the VSTS team is definitely exiting! A lot has changed since VSTS began its way to the cloud. All engineers now work directly on master. Small feature branches are merged through pull requests that run all kinds of automated tests and checks. Feature toggles are used to hide work that’s not yet done while still allowing it to be merged to master and deployed to production. And finally, live site sub teams are used to avoid too much disruption and make sure that teams can work both on new features and live site incidents.

The use of feature toggles and a master only strategy is essential for a successful DevOps implementation. If you are still using multiple branches then looking into these techniques will absolutely improve your process.

In the next part, we’ll look into how measuring everything that you can think of helps you in running a huge product like VSTS.