As humans, we sometimes make mistakes. One of those is committing sensitive data to our Git repositories. The canonical term for fixing those mistakes is Git scrubbing, which is just a fancy phrase for removing passwords, API tokens, license keys, etc. from a Git repository. Keep in mind that preventing those mistakes is the ideal solution, but, if you find yourself in the position where this already happened, I want to help. Let’s take a look at a couple of situations where you may need to scrub your repo and how you would go about doing so.
I Committed Some Sensitive Info
Hey, it happens. Do you have the power to change the value of the sensitive data?
Whew, I Can Change the Password/Token/Key
Great! Let’s scrub, or remove, the file containing the sensitive data from our repository by running the following commands in order:
git rm --cached <path-to-file> git commit --amend -CHEAD
These commands will remove the file containing your password and rewrite your commit without it. If you did not push your commit containing sensitive data to a Git repo hosting service like GitHub, you can use the existing password/token/key after running those commands since you only worked with it locally. If you did push the sensitive data up, however, you need to
git push -f the amended commit you just made, and you need to change your password/token/key.
PSA: Any time you send passwords or other sensitive data to your remote repo, you should consider it compromised and update the password.
Let’s talk about what to do when your sensitive data is compromised and you can’t change the password, token, or key.
Oh No! I Can’t Change the Sensitive Info
Not ideal, but we can still fix it. An example of unchangeable sensitive data might be a license key that you only get one of (like old school Photoshop or a CMS) and can’t change without communicating with customer service. Another example might be if your private package artifacts are committed publicly. In this case, we need to rewrite our Git history, both on local and remote, to remove any trace of our sensitive data.
There are a few tools that can help you do this (git filter-branch, BFG Repo Cleaner, and git filter-repo to name a few), and all of the documentation around them seems intended to frighten the reader, who I imagine is an already frantic developer who’s just realized they’ve compromised their application. Why are these documents written with such strong warnings? Because rewriting the entirety of a project’s Git history is a serious action. It’s important to understand why you’re doing it and how it’s going to work. Rewriting the history of a repo is a pretty powerful move, but it’s no more dangerous than rebasing, which we do a lot at Sparkbox in order to maintain a linear history. The key is to know when to do it and why. So with intentionality and understanding, let’s move forward with a
Git Filter-branch in Action
First, I’m going to
cd into my Git scrubbing example repo. If you clone this repo down and run
git log --one-line, you’ll notice that I have a suspicious commit:
de6515f docs: link to git scrubbing article in readme d13467e feat: hello world 6ec9e03 feat: init env a7c3ac2 feat: init js 8d46088 feat: init html fc670d3 Initial commit
Usually, we don’t check
.env files into our repos because they hold a lot of sensitive information. Before we fix that, let’s check the state of our tags by running
Now that we know the state of our repo, we can prepare to remove the
.env file with the following
git filter-branch command, which I found in GitHub’s docs, but the command could also be pieced together using the git filter-branch documentation:
git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch .env" \ --prune-empty --tag-name-filter cat
That’s a lot of flags. Let’s dig into what all this means before we run it:
You can try running without this, but you might get this error:
Cannot create a new backup. A previous backup already exists in refs/original/. Force overwriting the backup with -f. That’s because filter-branch won’t start if there’s an existing
refs/original/directory, so we need to force remove the existing files.
The filter is what tells Git how to rewrite the history. There are other filter options, but here’s a pro tip: this is faster than
--tree-filterbecause it doesn’t check out the tree.
This flag removes and unstages paths from the index and only the index. So your working files won’t be affected.
If no files match the file you’ve supplied to filter-branch, this tells it to exit the process with a zero status.
Sometimes after running filter-branch you’re left with empty commits. This flag removes them, which is nice for keeping the commit history clean. You wouldn’t want to see empty commit messages in your history.
This is our filter for rewriting our repo’s tag names. For every reference that is rewritten using filter-branch, this filter says “change the tag name to XYZ” depending on what name you provide it. Here, we’ve passed
cat, which just accepts the updated reference SHA without changing the tag name.
Now that we understand all the pieces of our command, we can run it. We get the following output, which tells us that our commit and tag SHAs are being rewritten and our
.env file removed.
Rewrite 6ec9e03d1ab89c8374f624015853705c6147786a (3/6) (1 seconds passed, remaining 1 predicted) rm '.env' Rewrite d13467efe14dd0bb5bce2578fb0e48d5a36f35c7 (3/6) (1 seconds passed, remaining 1 predicted) rm '.env' Rewrite de6515f8df775d0871cd2cc4400ea9352ce635cb (3/6) (1 seconds passed, remaining 1 predicted) rm '.env' Ref 'refs/heads/master' was rewritten tag1 -> tag1 (de6515f8df775d0871cd2cc4400ea9352ce635cb -> 1904bea774c1060d793dc615dbd438f52650139d) tag2 -> tag2 (de6515f8df775d0871cd2cc4400ea9352ce635cb -> 1904bea774c1060d793dc615dbd438f52650139d)
Let’s see what our
git log --one-line looks like now:
1904bea docs: link to git scrubbing article in readme 4b3afd8 feat: hello world c9b8844 feat: init js e9dc984 feat: init html 07e01dc Initial commit
All our commit SHAs were rewritten, and our commit was completely removed because of our
--prune-empty flag. If we run
git tag again, we’ll see that the tag names haven’t changed because of our
--tag-name-filter cat flag:
We’ve successfully rewritten our commit history, so now let’s push it up with
git push origin --force --all and
git push origin --force --tags to get our remote repo up to date.
At this point, you might be good to go. However, if you have any pull requests, open or closed, that include the sensitive data, you’ll need to contact your Git hosting provider and ask them to remove them.
For any work that existed prior to the
git filter-branch operation, the team should rebase off the repo’s default branch. Sparkbox prefers rebasing commits on top of the default branch instead of merging because rebasing gives us a clean Git history without any merge commits. In this case, if we did create a merge commit, we would risk reintroducing all of the old history that we just removed with
git filter-branch. That’s why work needs to be rebased at this point.
Remember how I said humans make mistakes? You know what else humans do? Learn from their mistakes. Removing sensitive data from a repo can be a lengthy process. Ideally, we would never commit sensitive data. Here are some ways you can prevent committing sensitive data to your repositories.
.gitignorefile at the beginning of your project as a first line of defense
Look at the changes you’re going to stage before you stage them. You can do this using
git add --interactive, using your code editor, or using other third-party tools like Kaleidoscope
Be intentional and thoughtful about the files you’re committing by staging files individually instead of using commands like
git add .
If you commit sensitive data, don’t panic. Remember that you can remove the sensitive files if the data is changeable, and you can use a tool to rewrite repo history to remove sensitive info if it’s not. Choose to learn from your mistakes and implement ways to prevent future issues. And now that you know how to prevent committing sensitive data, and how to fix it if it does get committed, go forth and fearlessly write great code!
Sparkbox’s Development Capabilities Assessment
Struggle to deliver quality software sustainably for the business? Give your development organization research-backed direction on improving practices. Simply answer a few questions to generate a customized, confidential report addressing your challenges.