Fast Filewise Git Blame
Saturday, May 18, 2024.
When was each file in a git repository last changed, and who changed it? Here's a one liner shell script that produces a fast filewise git blame report:
#!/bin/sh
TZ=UTC git log --name-status --date=iso-strict-local --pretty="%ad%x09%ae" "$@" |
perl -F'/\t/' -lane '
if (/^[ACDMRTUXB]/) {
$path = @F>2 ? $F[2] : $F[1];
print "$date\t$email\t$path" if -e "$path";
} elsif (@F) {
($date, $email) = @F;
}
' |
sort -k3,3 -k1,1r |
uniq -f2
The report looks roughly like this:
~/src/coreutils $ git-filewise-blame src | head
2024-01-01T13:22:42 a@b.com basename.c
2024-03-19T15:55:18 a@b.com basenc.c
2023-10-27T15:56:39 x@y.com blake2/b2sum.c
2021-11-01T05:30:38 x@y.com blake2/b2sum.h
2021-12-18T17:34:31 x@y.com blake2/blake2b-ref.c
2021-12-18T17:34:31 x@y.com blake2/blake2.h
2022-09-15T05:30:31 x@y.com blake2/blake2-impl.h
2016-10-31T13:29:34 a@b.com blake2/.gitignore
2024-04-06T22:13:23 x@y.com cat.c
2024-01-01T13:22:42 a@b.com chcon.c
...
The keyword here is fast. All other approaches I've found execute git log
for every file in your checkout. Here's one example of the slow approach:
git ls-files | while read file; do
git log -n 1 --pretty="Filename: $file, commit: %h, date: %ad" -- "$file"
done
If your repository has many files and deep history, an exec-for-every-file approach will be horrifically slow – on the order of minutes or even hours. By contrast, the git-filewise-blame
approach consumes the output of a single git log
command. On my laptop, it takes 1 minute to filewise blame the entire webkit
git repository, which has 405k files and 275k commits (!).