Our code branching journey: From one to many to one

Tableau developers are sharing their code branching journey—from one to many to one.

When facing tough engineering decisions within the Tableau Dev Team, we’ve always appreciated reading about the choices, trade-offs, and thinking of our colleagues across the software industry. In that spirit, I’d like to share one of our engineering stories—a branch and build management decision process that spans several years. If you geek out about scaling coding teams and effective branching strategies, then here’s hoping what we learned from our sometimes-painful experiences can benefit you.

A scaling story

This story begins during the endgame of a major release many years back. We had all been committing to one main branch for several years, growing from nothing to almost 100 commits per day. Processes and systems tend to break down fairly regularly when your team is growing by 50-80 percent annually, and as the endgame bug fixes flew in fast and furious, we reached a tipping point in build stability. Why at that point? Primarily, running the build+tests took quite a while-on the order of 1-2 hours commonly. After re-syncing to head, you’d never get to commit your changes before someone else checked in ahead of you. We’d also reached the point (somewhere around 100,000 tests for us, if my memory serves) that intermittent test failures were fairly common. The build+test+commit sequence was not automated—it was the manual responsibility of the developer. Take those conditions, pile on a bunch of pressure to get fixes in fast, and we had a recipe for cascading failures to pile up faster than we could detangle them. At that point, we would lock the branch, identify the offending (or colliding, but that was still rare) commits, back out some changes, get a green test run, and re-open the branch. However, all the developers who had been waiting with their commits would then all try to jump in at once, creating a thundering herd effect that would immediately destabilize the branch again, and around another loop the cycle would go. We leaned on being extra careful, plus some heroic firefighting, to squeeze through that release.

The big switch

That rough patch taught us a few things. We had too many unreliable tests. When an intermittent test failed for a reason unrelated to the pending commit, sometimes the developer would decide to ignore the error and commit anyway. Usually they were right. But being wrong even 5 percent of the time was too much. We knew that our full build and test cycle was too long. We also knew that our overall architecture was monolithic and that our lack of modularity was a systemic cause of complexity and regressions. Lastly, we knew that these systemic factors were going to take quite a while to change—and that we needed relief as soon as possible. To make a long story short, we reasoned that if we couldn’t keep one big branch stable, then try lots of smaller ones! By reducing the commit flow in any particular branch, the theory went, the citizens of each branch would have an easier time keeping things green. I haven’t heard an industry term for this, so I’ll call it “load-branching”.

"Load-branching: Creating more branches and load-balancing your developers/commits across those branches, with the hope of improving stability and productivity."

Guided by the code and feature affinity we were aware of, we split into about a dozen fairly fine-grained branches, e.g.: server, query, native, mac, etc. Each of these branches was a child/spoke off of Main, which was the parent/hub through which all code flowed. (We did experiment with a “group” branch between several development branches and main, but quickly found that it was bad ROI.) The need to do bulk code integrations was what crystallized the position of “branch owner” as a branch management role.

"Branch Owner: The courageous (or unlucky) individual responsible for integrations to and from a branch, and more generally for ensuring branch health and code/build stability."

Overall, this seemed to work! Commit contention was greatly lowered. Branches were consistently open. Branch Owner was generally a part-time role for one person. We had bought ourselves time to work on reliability and modularity—though looking back, we wish that we had invested even more, and sooner.

Different challenges

The next direct pain to rear its head was code latency. Some branches stayed “close” to Main (as measured by their frequency of integrations), but others drifted quite out-of-date as the months went on and being Branch Owner lost its novelty. Code flowed down to the child branches fairly consistently, but dealing with unreliable tests or other quality issues made going up to Main high effort.

That code decoherence due to integration latency caused painful experiences of work blockage, as impacted individuals couldn’t do much other than wait for the relevant code to eventually come together. This led us back to a path of consolidation. We picked off a few small branches to get down to eight, then it took more of a coordinated rebalancing to jump down to five branches. Now the branch names were more general (experience, core, data, etc) and each had a corresponding stabilization branch for payload prepping. We implemented a strict two week integration cycle, with each branch holding the commit-to-main mutex for two days out of 10 as their integration window. Would a 2 week max code latency be fast enough? Well…yes and no.

Coping with two weeks latency was more manageable, but despite our best efforts, branches hit their window only 70-80 percent of the time. By this point Branch Ownership had become a full-time job, or even more, as we split out the integrator/merge role from the automation wrangling role and spun up rotations to fairly distribute the tax of losing someone full-time for a sprint. A couple branches of these five were now as large as our original branch, and while things were generally not on fire, they weren’t close to comfortable either.

As the team grew and the number of changes in each branch kept getting bigger, integration payload size increased and the merges kept getting more and more difficult and risky. Over time, test breakages gradually became more tolerated and ignored, as “this came from another branch” or “it’s fixed in another branch” became unsatisfying but accepted excuses. The norm for committing became “something is broken, but the failures in my test results aren’t any worse than what’s already in the branch right now,” which we was a demoralizing and ineffective way to try to maintain branch health. Shades of red had come to dominate the traffic light signal.

Pushing the pendulum back

Graphic of the journey
After camping at five branches for a while, it was time for the endgame back towards one branch and eliminating code latency. We consolidated down to 2.5 branches, where the “.5” represents a variation on load-branching that we tried here. A large portion of development was in platform C++ code, while another large portion in client code that depended on the C++. We connected these branches ("Cpp" and "Near") directly, so they could be integrated with each other more quickly. To make integrations quick and easy, we locked a large amount of code in the “Cpp” branch. This was both better and worse: code latency was far lower on average, but a single code change that spanned platform and client code had to be broken up into two commits with an integration in-between.

We added a ratchet to our reliability through the introduction of a gated commit system (for a given set of tests, only let a code commit through if everything passes). Turning that on allowed us to see the total size of the remaining long tail of reliability issues (i.e. there was a lot of blockage and pain), and re-motivated another multi-month push on those issues before we turned the gate back on for good. In the meantime, we were still spending about 12 people full time on branch ownership: merge effort increases super-linearly with payload size, and we found that we had reached the break-even point when it came to balancing the number-versus-size of integrations.

As we continued cranking our systemic improvement ratchets, the set of proven-reliable tests in gated commit kept growing, and our modularity efforts also started to have a noticeable effect as we broke off bits of monolith into a polyrepo ecosystem. We retired the “Far” branch next, so that no more big bulk payloads were traversing through Main. The 1.5 remaining branches operated on a lower validation bar which, no surprise, allowed us to integrate on a daily basis with less heroic effort. (Side note: This can be independently useful as a “modularization ratchet” strategy to drive the creation of a proper API boundary between components.)

One branch

The last consolidation to one branch relied on one more key technical process located after the commit gate: we started muting new post-commit test failures, lowering the visible red-to-green recovery time by an order of magnitude. Optimizing the red/green signal to reflect “do I need to do anything?” for the largest amount of people was a clear win. It doesn’t mean that problems can be hidden or ignored, though—we generate (automatically, at this point) a high priority defect of a special type “branch problem” and monitor the overall number of broken windows in the process of recovery. In dealing with these, immediate and aggressive back-out is the best and quickest path to recovery whenever possible: no commit is more important than a green branch state.

In conclusion

Our transition to one branch was almost two years ago now. We have several hundred developers committing to a single trunk, without long-lived feature branches. We’ve generally found that, all else being equal, more branches did not result in lower long-term engineering costs. It can move costs around or concentrate them, but it doesn’t actually fix any of the underlying issues.

Code latency is the primary counterbalancing cost of load-branching. Modularity is important, and load-branching increases the costs of poor modularity via integration conflicts. Reliability is important, and while load-branching can reduce the experience of instability, it increases the organization’s total surface area of exposure to instability even more. The costs of instability can be concentrated and contained through the combination of load-branching plus branch owners—but while this combination lowers direct cost to the majority of developers, it does so by focusing the cost on the branch owners, and hurts the clear signals underlying the essential social norms around responsibility and rapid recovery.

Practically, my parting tips are as follows:

  • Use modern automated CI pipelines that gate commits (whether or not you’re using pull-requests) to guard against human mistakes and accidental breakages.
  • Pay attention to test reliability before it starts to become an acute issue.
  • Invest regularly and substantively in disciplined refactoring and evolutionary design.
  • Use trunk-based development as consistently as possible.
  • Retrospect regularly.

Good luck! Want to share your story or discuss the above? Swing by this forum thread.