Evolving a codebase at Google Scale

I’ve spent over a decade working on developer infrastructure at Google, contributing to areas like the build system, IDE, refactoring tools, and linters. Contributing to the evolution of one of the world’s largest codebases was a challenging task. What might seem like a tiny change can ripple across thousands of files, affecting teams you’ve never met. In my role, I worked on many migrations and I had to make these changes “transparent” — to improve the codebase's health and keep things running smoothly, without disrupting the rest of the engineers.

In this post, I’ll share some of my experiences working on developer tools and large-scale migrations. Keep in mind, this isn’t a guide on what you should do. Every codebase is different. For example, a multi-repo approach can work just as well for some companies. This also isn’t an official Google guide (for that, check out Software Engineering at Google). These are lessons from my personal experience, and I hope they can offer insights for other people working on large codebases.

A Consistent Codebase

Google’s monorepo is massive—billions of lines of code, shared by tens of thousands of engineers. When I first joined, one thing impressed me: how everything felt the same. Sure, there are exceptions like Chrome and Android, but for the most part, the development environment is incredibly consistent.

Let me highlight a few of the rules that guide development:

  • Only a handful of programming languages are allowed.
  • Language teams control the rules: they pick the compiler, define the standard libraries, and set coding restrictions.
  • The code review requirements are applied globally.
  • Everyone uses the same tools: a unified build system, a single code review tool, global presubmit checks, etc.

This consistency has its perks. It becomes much easier to share and reuse code across different projects. Engineers can depend on libraries and components developed by other teams without worrying about compatibility issues, since everything is built and tested together. And when a tool is used by many teams, it’s worth investing in it. Google can afford to have full-time engineers developing and maintaining tools like linters or formatters, knowing that they will impact every other project.

However, this homogeneity has its challenges, particularly when it comes to evolving the code and the tools themselves. Since everyone relies on the same infrastructure, any change—no matter how small—can potentially affect a large number of projects. To manage this, we rely heavily on automated testing and continuous integration systems that run checks on every change submitted. Even then, surprises happen, and that’s where the fun begins.

In this article, I’ll discuss how we rolled out changes in the codebase, whether it's updating a library, a tool, or enforcing new conventions. Let’s start with the simplest, local changes and later expand the scale.

Code Visibility and Dependencies

The monorepo approach makes it easy for anyone to add dependencies on other libraries. Engineers love reusing code, but dependency management is probably one of the biggest challenges the Google codebase faces. When too many dependencies are added, it can be difficult to maintain or refactor the code; and compilation times increase.

Private Libraries and Code Ownership

One of my favorite additions to the Bazel build system came around 2010: the ability to set a library’s “visibility”. Just like you can define public or private methods in a class, you can restrict the visibility of a library to just your project. This approach allows you to iterate quickly on your code without worrying about breaking other teams' projects. A code owner mechanism prevents other engineers from modifying the code in your team directory, so they really need to request your permission to use your library. This helps maintain clear ownership and responsibility.

By keeping certain libraries private, you minimize unnecessary dependencies. This setup also simplifies the process of updating or refactoring code, as the impact of changes is confined to a smaller, more manageable scope. So by default, libraries are private until the owner decides to expose them more widely.

If another team wants to reuse your simple function, they always have the possibility to duplicate it. As the Golang proverb says, “a little copying is better than a little dependency”. But of course, some code has to be shared.

Code Shared with Partner Teams

In cases where a library or tool needs to be shared across a few teams, managing visibility becomes a balance between collaboration and control. Visibility can be set to allow only specific projects to use the code, to help collaboration while keeping dependencies explicit and manageable.

When you need to update a shared library, you can easily run the partner team’s tests, as the test infrastructure is very standardized. If the other team's code needs to be updated, it should be easy to contact them or send them a pull request.

Public Libraries and Tools

Some libraries and tools are so widely useful that they become public within the monorepo, available to every engineer. But once a library goes public, it can quickly transform into a dependency for hundreds, even thousands, of projects. This widespread use makes updates and refactoring much trickier. As an extreme example, compilers and the build system can be difficult to update, since literally everything depends on them.

If you decide to modify a widely used public library, you have to ensure that your changes don’t break any existing users. The testing infrastructure includes a distributed test system that allows you to run a large number of tests across all known users of a library with a single command. If your changes cause any tests to fail, you either need to adjust your update or work with the teams affected to make necessary adjustments.

This responsibility is captured by the so-called "Beyoncé Rule" — "If you liked it, then you shoulda put a test on it." In other words, if you make a change, you’re responsible for ensuring it doesn’t break anything. But if some untested code fails, the burden falls on its owner. This approach encourages teams to cover their code with tests. Tested code will keep working. Untested code is bound to break as the codebase evolves.

Handling Large-Scale Changes

Automating the code changes

Changing a few files is easy. Changing twenty? Annoying. Changing hundreds? Let’s automate that. Generally, automation involves two steps: detecting patterns in the codebase and applying transformations. Depending on the task, each step can be simple or complex.

I’ve always been a fan of Unix tools and shell scripting. While it’s not possible to run “grep” on the entire codebase, the codebase is indexed and there’s a dedicated tool for running regexps on every file. After narrowing down the list of files to edit, “sed” can be a great option. Of course, more specialized tools also exist, depending on the situation.

One of the first projects I initiated back in 2013 was Buildozer. This tool allows users to query Bazel build files and apply transformations. Buildozer became a key component for a variety of migrations, playing a crucial role when migrating the codebase to C++ modules by automating updates across thousands of build files.

The effectiveness of automation largely comes from the high standardization of the code. Code formatters, style guides, and linting tools enforce a consistent style throughout the codebase. Before we introduced Buildifier, a code formatter, many migrations proved impossible. Editing files while maintaining the original formatting was a nightmare, and code owners often voiced their frustrations during reviews.

When the code follows strict rules, it’s often easier to analyze it and transform it, as there are fewer edge cases to consider. Standardization reduces the variability that tools need to handle, making large-scale changes more feasible.

Automating the pull requests

How many files can you update in one atomic commit? There’s always a threshold where a change becomes “too big”. Source control systems may reject it, performance may lag, or you might fail tests or presubmit checks. Even the code review interface might struggle to keep up, or a human reviewer might request you to split the change.

It can also happen that you spend time getting the change ready, reviewed and tested, only to find a conflict right before submission. You need to pull the new changes, run the tests again, and hope no one modifies another file in the meantime.

The solution at Google is called Rosie. Rosie is like an internal bot that helps break down massive changes and submits them in smaller, manageable chunks that are tested independently.

Unfortunately, making the large change splittable is not always straightforward. For instance, changing a function prototype while still supporting both old and new call sites can be tricky. Often, it’s better to introduce a new function, migrate the call sites, and then remove the old function.

Automating the approvals

As mentioned earlier, there are designated owners for specific directories. If the change is big and affects files in 100 directories, you might need to get approval from 100 people. Rosie can identify the required approvers, ping them and add more reviewers if someone is unavailable.

Remember my goal of making “transparent changes” that don’t disrupt engineers? Asking hundreds of people for a review is certainly disruptive, especially when the background for the change is not relevant to them. That’s where “global approvers” come into play.

For ten years, I was a global approver, a coveted status that allowed me to approve changes almost anywhere in the codebase (with a few exceptions for critical directories). I earned this responsibility as a junior engineer, and over the years, I’ve approved over 40,000 changes, frequently using automation. Global approval is this magic tool that helps expedite migrations that are deemed safe.

Handling Long-Term Migrations

With the strategies discussed above, it might seem like you can tackle migrations of any scale. However, you must consider the time factor. A migration can stretch over a week, a month, a year, or even longer. Take, for instance, the migration to Python 3; that journey took nearly a decade.

Time introduces layers of complexity. As mentioned earlier, you need to keep everything functioning throughout the migration. This often means supporting both the old and new methods simultaneously, which can increase maintenance costs. Long migrations are expensive and can confuse the users. Moreover, migrating a living system—one that’s constantly changing—presents its own challenges. It feels a bit like trying to change the wheels on a moving train.

But that’s part of the job. As the Software Engineering at Google book says:

“Software engineering” encompasses not just the act of writing code, but all of the tools and processes an organization uses to build and maintain that code over time. [...] Software engineering can be thought of as 'programming integrated over time'.

Restrict Visibility

One effective first step in a long-term migration is to restrict the visibility of deprecated elements. This strategy works particularly well when you aim to phase out a library or tool. By limiting access, you can prevent new code from relying on outdated libraries. This helps to reduce the migration's scope and ensures that efforts to remove or replace old code are not undermined by new usages.

Using Warnings

To further discourage the use of deprecated elements, you can issue warnings. However, command-line warnings rarely work: at some point, an engineer will see a warning for something they didn’t introduce and choose to ignore it. Over time, warnings will accumulate, reducing the chances of someone noticing the problems they just introduced. From my experience, there’s a better way to surface warnings:

  • Show a warning in the IDE at the corresponding line.
  • Show a warning in the code review interface, at the corresponding line.

There are nuances to consider. Engineers may feel frustrated if they see warnings for code they didn’t write. So the warning is sometimes shown only on modified lines. But there are also cases where a warning is introduced when another line is modified—like when removing the last use of a variable, prompting an unused variable warning on the declaration, which wasn’t changed.

Google uses Tricorder, a program analysis ecosystem, to collect and surface these warnings. In general, warnings work better when they:

  • Are easy to understand, with a link to the documentation.
  • Are easy to fix, preferably with a fix generated that can be applied in a single click.
  • Minimize false negatives.

But warnings need to be very carefully designed, or they may cause more harm than good.

Using Presubmit checks

As warnings can be ignored by the developers, they can be promoted to presubmit checks. This reduces the number of regressions during the migration, but you have to be even more careful with this. If the message isn’t well-documented, actionable, and easy to address, it can be highly disruptive for engineers.

Manual work

Some migrations require a lot of manual work. In such cases, it’s common to share the workload within the team or even seek volunteers across the company. This kind of migration often requires extensive coordination and documentation.

There are also cases where the migration needs the cooperation of another team. For example, the users of a library need to manually migrate to the new solution. This is highly disruptive as it can affect the team’s priorities and schedules.

Generally, it’s expected that the team spearheading the migration handles most of the work, delegating tasks only when absolutely necessary.

Handling Common Pain Points

One frustrating issue you can run into during migrations is flaky tests. These are the tests that fail randomly, even though nothing is wrong. It can happen because of a timeout, a race condition, or any other source of non-determinism. When the fail frequency is low, let's say around 1%, the owners might not notice the issue, but you're likely to get burned when running 10,000 tests. To combat this, we rely heavily on systems that can rerun tests and try to separate the real issues from the flakes. Note that Bazel has a flag (--runs_per_test) that helps detect flakiness.

And it’s not just flaky tests that can slow down a migration. Teams often have their own specific presubmit checks in place—rules that are tailored to their code and workflows. While these checks are crucial for catching errors in the day-to-day, they can become a real headache when you’re trying to satisfy the checks in a large number of directories. A solution is to disable most presubmit checks when submitting these safe cleanup changes.

But even when the technical side is going smoothly, there’s another layer of complexity: communication and user-support. With longer migrations, you’ll often end up with two solutions in the codebase—one that’s deprecated and one that’s the future. This can create confusion, especially for engineers who aren’t directly involved in the migration. Good documentation is key here to explain what's happening, what's the current state and how to proceed. It's sometimes worth sending announcements, but only for the most user-visible migrations (sending an email to a large user base is expensive).

Dashboards can be useful to monitor the migration’s progress. They can give information about where you’re at, what’s been migrated, and where things might be breaking. Most importantly, they help spot regressions quickly.

Migrations can easily go wrong and significantly impact engineering productivity. To avoid the common pitfalls, Google has set up a Large-Scale Changes Committee. It's a group of experienced engineers who can give guidance, share best practices, and identify issues in your migration process. They can also identify when a migration is not worth doing.

Migration Anecdotes

Let me share a couple of experiences from my time at Google that illustrate these concepts.

After introducing the Starlark language to Google, we decided to migrate the build system from Python to Starlark. Everything was designed to allow a smooth, incremental migration, allowing both languages to coexist. But the languages being different, there was a huge number of corner-cases to handle. This was intentional, as we wanted to move away from the Python features that we considered detrimental.

During the two-year migration, we migrated 200,000 build files to the new language. I aimed to make this transition as transparent as possible—I certainly didn’t want engineers to feel the need to learn a new language. One day, an old-time engineer approached me, surprised. “Wait, build files aren’t written in Python?” I replied, “Not anymore. I migrated your files a couple of years ago, and I’m glad you didn’t notice.”

On a different note, one of my earliest migrations involved fixing the glob function in our build system. Just two weeks after the migration, I found myself cc’d on a post-mortem document. My migration script didn’t handle an edge case, resulting in the glob function failing to match a crucial file. The file was not included in the tool’s release, which didn’t have test coverage for this scenario. At the end, the code reached production and broke Google Groups: the anti-DDoS protection failed because of a missing file and prevented any new message from being posted. Did anyone blame me? No. The post-mortem identified various action items to prevent similar incidents in the future. This is how we learn.

Another painful experience involved a change to fix file visibility in Bazel. The problem is that regressions were introduced everyday in the repository by engineers who were not aware of the issue. We couldn’t implement a warning because it required checking every build file that depended on the modified file, which would take a lot of time. In the end, we did many cleanup iterations and managed to submit the change, only to discover that we had broken numerous scripts in the repository that lacked testing. It was a painful lesson, but it reinforced the need for rigorous testing and communication.

Conclusion

Hyrum’s Law reminds us that "with a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody." This means changes, no matter how trivial, can have ripple effects in unexpected places. Warnings and documentation are helpful guides, but they’re not enough on their own. True enforcement is essential: whether it’s restricting visibility, adding presubmit checks, or requiring formatting, they are the guardrails that keep things moving in the right direction.

Engineers naturally want to move fast, and that’s a good thing: it’s how innovation happens. But in a system of this scale, speed without structure can be dangerous. That’s where the infrastructure team comes in. They’re the ones laying down and maintaining the tracks, making sure engineers can move fast without derailing the whole system. Whether it's enforcing code standards or guiding large-scale migrations, their job is to balance speed with stability.

Yet, there are limits to this consistency. For a long time, there was significant fragmentation in the IDEs used across Google. This lack of uniformity created its own challenges and limited innovation. It’s something I’ll discuss in a future post. But for now, the takeaway is this: developer infrastructure isn’t just about keeping things running—it’s about empowering engineers to move quickly while staying on track. And that delicate balance is at the heart of everything they do.

Comments are closed, but I appreciate the feedback. Feel free to reach out.