History Simplification with git2-rs

2023-01-01

Markus Brückner

Writing a Git GUI like Corylus inevitably includes coming up with a decent layout for the commit history and its ancestry relationships. This is easy enough for the full history, as Git gives you a Directed Acyclic Graph (DAG) of commits. Partial histories (e.g. for a single file) are a different story. The graph is still directed and acyclic, but it's by default no longer connected, i.e. you cannot jump from one commit to its parents during layout because parents may be missing (e.g. because the file you're looking at didn't change there). The CLI has a neat little trick up its sleeve where it will rewrite the parent references to reconnect the graph. This article describes how this may be done (and, by extension, how Corylus does it).

Laying out the commit graph

At first glance the git commit graph is easy to handle: each commit has an arbitrary number of parents referenced by their unique ID. Laying that out in a vertical fashion is straightforward (Note: There are multiple ways of doing it with slightly different properties, but this is the way Corylus does it):

Order the commits by date and ancestry relation. libgit2 and by extension git2-rs already offer this functionality via the Revwalk struct with the Sort::TOPOLOGICAL | Sort::TIME mode. This ensures that you will only ever see a commit after all its children, which is kind of important for the layout algorithm.
Lay out the commits line by line in a table-like structure (easy in Corylus, as the UI is HTML/CSS anyway), so for each line we have to calculate 3 things:
- The position of the node, i.e. the dot representing where the commit is located.
- Branch and merge lines. If the current commit has more than one parent and/or child, we need to display some kind of branch/merge line starting/ending at the commit node.
- Any parallel ancestry lines just passing through because they represent a pallel branch to the current commit.
Draw the layout line by line in SVG.

The layout step is done by tracking an array of rails (so called because the graph looks a bit like the tracks in a train station in the end). Each rail either contains the ID of a commit expected there some time in the future or is empty (indicated by a None value in Rust or undefined/null in Typescript resepctively). Calculating these rails follows a few simple rules:

Commits are always placed on the left-most rail expectimg them as a parent or on an empty rail (for commits representing the HEAD of a parallel branch). This makes the whole graph gravitate to the left/line start over time.
Any occurences of the placed commit are removed and replaced with a None value.
The first parent of a commit is placed on the left-most rail formerly occupied by the commit. Any additional parents are placed on the first available rail to the right. (Note: This makes the graph expand the the right more than necessary, but makes the layout slightly easier to implement and, arguably, to read)
If any rail to the right expects them as a parent as well, we have a branch parent and need to draw the connections to that target rail. (Note: visually the terms "source" and "target" make little sense here because the graph is layed out in reverse chronological/topological order. Reading top to bottom merge lines are actually "outgoing" and branch lines are incoming. That is why I use the terms "branch" and "merge" here to make it clear what we're talking about".)
If a commit has more than one parent, we have a merge commit and need to draw the connections from the source rails as well.
Any independently occupied rails just carry through to the next line.

Drawing a single commit in the UI then can be done by simply drawing a circle for the commit, a few colored lines for the occupied rails and a few connections for branch and merge lines. Coloring is done based on the index in the rails array. Being able to draw each commit independently is crucially important because only a few commits are visible in the UI at any time, but drawing the whole graph quickly becomes impossible, even for small-ish repositories with a few hundred commits.

A partial commit graph

Here's where things get tricky: removing commits from the graph breaks ancestry relationships. If, e.g., you were to only request a graph for a particular file, git2-rs doesn't have much to offer. There's the Revwalk struct representing an iterator over the history. While this offers some ability to filter (see all the hide* methods for details), these are of no use here because they don't allow for filtering individual commits, but rather filter the commit and all its parents, which is too much. Luckily rust's standard Iterator's capabilities are enough to filter the history. What it doesn't do, however, is rewriting the parent relationships, breaking the commit graph in an interesting fashion:

The history graph has a whole lot of pseudo-parallel branches because the layout algorithm doesn't know that these are actually ancestors of each other, as intermediate commits are filtered

Since basically every new commit is unexpected for the layout algorithm (as it never got to see the intermediate commits that were filtered), it places them on separate rails. On the other hand rails seem to start in the middle of nowhere, because the expected parents never come (for the same reason). Clearly this is NOT what we want.

The git CLI has this clever trick where it rewrites the parents of a commit in order to ensure a correct graph layout. Sadly neither libgit2, nor git2-rs can do the rewrite. I'll have to do that myself then, I guess.

Rewriting parents

A first simple attempt at rewriting parents simply copies the ideas of the rail algorithm from the layout: simply replace a commit with its parents when appropriate. In this case "when appropriate" is whenever a commit is filtered from the history. So for every commit from the the algorithm does the following.

If the commit changed the file, push it to the output history:

  if has_changes(commit) {                                                                                                                                                                                          
    internal_history.push(output);                                                                                                                                                                                                             
  }

with has_changes roughly being (vaguely Rust'ish pseudo code):

  if commit.parents.count() == 0 {                                                                                                                                                                                                                                  
    diff_tree_to_tree(None, commit.tree).contains(path)
  } else {                             
    commit.parents.any(|p| diff_tree_to_tree(p.tree, commit.tree)
  }

This basically tries to figure out whether the commit either contains the file if it's first commit in the history (the parents.count() == 0 branch) or whether the path is contained in the diff to any of the parents. Since Git stores basically full snapshots of the working directory with each commit (albeit with clever de-duplication), we need to calculate the diff on the fly and match it accordingly. Luckily git2-rs contains the code for the diffing and matching already, so we just need to do the glue code.

Rewriting the history is simply replacing all references to the filtered commit with references to its parents:

  if has_changes(commit) {                                                                                                                                                                                          
    internal_history.push(output);                                                                                                                                                                                                             
  } else {
    for entry in internal_history {
      entry.parents.remove(commit);
      entry.parents.push_all(commit.parents);
    }
  }

For the full code see the load_history and has_changes in Corylus's implementation.

After sucessful parent rewriting the layout algorithm produces things like this:

The history graph has one continuous line showing the changes to a file over time

Pruning redundant edges

The algorithm currently may create spurious empty branch lines in certain circumstances:

The history graph has an empty parallel branch with no commits on it

This happens whenever there's a merge commit with the file not being changed on one of the parallel branches. Since we cannot know whether a file will be changed on any of the side branches, we have to travel down all parallel lines and filter accordingly. These redundant paths obviously do not provide valuable information and should be removed from the layout.

Luckily there's an algorithm for that: transitive reduction. It basically behaves like a hiker enjoying the view: whenever it finds a long way around to a neighboring node in the graph, it throws out the direct edge. This can be implemented via a simple depth-first search (or any other path finding algorithm for that matter):

for commit in internal_history {
  for neighbor in commit.parents {
    if has_indirect_path(commit, neighbor) {
      commit.parents.remove(neighbor);
    }
  }
}

Again: for details see the actual implementation in Corylus.

While this does require a second pass over the history, it only does so for the remaining commits after filtering. Even for large repositories the subgraphs per file tend to be small, making this implementation efficient enough.

With all the filtering, rewriting and reduction in place, Corylus produces the correct ancestry graph for individual files (see also the correct layout of parallel branches with changes on both branches):

The history graph with no empty parallel branches

Conclusion

Using libgit2 and git2-rs to layout partial history graphs of a Git repository has a few challenges. While it is easy enough to filter unnecessary commits from the history as returned by a Revwalk, the libraries do not provide any help for simplifying the history and removing any filtered commits from the list of parents in commits retained. By simply replacing filtered commits with their respective parents throughout the retained history and some additional transitive reduction to remove any spurious empty side branches, Corylus is able to display a correct ancestry path for changes to single files.

Acknowledgements

libgit2 andgit2-rs do the heavy-lifting of accessing the Git repository data.

The contributers of this libgit2 issue for inspiration on how to filter the history data.

This Stack Overflow discussion for inspiration on pruning the graph of empty side branches.