From graphs to Git
Table of Contents
Introduction
This is an introduction to Git from a graph theory point of view. In my view, most introductions to Git focus on the actual commands or on Git internals. In my day-to-day work, I realized that I consistently rely on an internal model of the repository as a directed acyclic graph. I also tend to use this point of view when explaining my workflow to other people, with some success. This is definitely not original, many people have said the same thing, to the point that it is a running joke. However, I have not seen a comprehensive introduction to Git from this point of view, without clutter from the Git command line itself.
How to actually use the command line is not the topic of this article,
you can refer to the man pages or the excellent Pro Git bookSee
“Further reading” below.
. I will reference the relevant Git commands
as margin notes.
Concepts: understanding the graph
Repository
The basic object in Git is the commit. It is constituted of three
things: a set of parent commits (at least one, except for the initial
commit), a diff representing changes (some lines are removed, some are
added), and a commit message. It also has a nameActually, each commit gets a SHA-1 hash that identifies it
uniquely. The hash is computed from the parents, the messages, and the
diff.
, so that we
can refer to it if needed.
A repository is fundamentally just a directed acyclic graph
(DAG)You can visualize the graph of a repo, or just a subset
of it, using git log
.
, where nodes are commits and links are parent-child
relationships. A DAG means that two essential properties are verified
at all time by the graph:
- it is oriented, and the direction always go from parent to child,
- it is acyclic, otherwise a commit could end up being an ancestor of itself.
As you can see, these make perfect sense in the context of a version-tracking system.
Here is an example of a repo:
In this representation, each commit points to its
childrenIn the actual implementation, the edges are the
other way around: each commit points to its parents. But I feel like
it is clearer to visualize the graph ordered with time.
, and they were organized from left to right
as in a timeline. The initial commit is the first one, the root of
the graph, on the far left.
Note that a commit can have multiple children, and multiple parents (we’ll come back to these specific commits later).
The entirety of Git operations can be understood in terms of manipulations of the graph. In the following sections, we’ll list the different actions we can take to modify the graph.
Naming things: branches and tags
Some commits can be annotated: they can have a named label attached to them, that reference a specific commit.
For instance, HEAD
references the current commit: your current
position in the graphMove around the graph (i.e. move the HEAD
pointer), using git checkout
. You can give it commit hashes, branch
names, tag names, or relative positions like HEAD~3
for the
great-grandparent of the current commit.
. This is just a convenient name for
the current commit.Much like how .
is a shorthand for the
current directory when you’re navigating the filesystem.
Branches are other labels like this. Each of them has a name and acts a simple pointer to a commit. Once again, this is simply an alias, in order to have meaningful names when navigating the graph.
In this example, we have three branches: master
, feature
, and
bugfix
Do not name your real branches like this! Find a
meaningful name describing what changes you are making.
. Note that
there is nothing special about the names: we can use any name we want,
and the master
branch is not special in any way.
TagsCreate branches and tags with the appropriately
named git branch
and git tag
.
are another kind of label, once again pointing to a particular
commit. The main difference with branches is that branches may move
(you can change the commit they point to if you want), whereas tags
are fixed forever.
Making changes: creating new commits
When you make some changes in your files, you will then record them in
the repo by committing themTo the surprise of absolutely no one, this is done
with git commit
.
. The action creates a new
commit, whose parent will be the current commit. For instance, in the
previous case where you were on master
, the new repo after
committing will be (the new commit is in green):
Two significant things happened here:
- Your position on the graph changed:
HEAD
points to the new commit you just created. - More importantly:
master
moved as well. This is the main property of branches: instead of being “dumb” labels pointing to commits, they will automatically move when you add new commits on top of them. (Note that this won’t be the case with tags, which always point to the same commit no matter what.)
If you can add commits, you can also remove them (if they don’t have
any children, obviously). However, it is often better to add a commit
that will revertCreate a revert commit with git revert
, and remove a
commit with git reset
(destructive!).
the changes of another commit
(i.e. apply the opposite changes). This way, you keep track of what’s
been done to the repository structure, and you do not lose the
reverted changes (should you need to re-apply them in the future).
Merging
There is a special type of commits: merge commits, which have more
than one parent (for example, the fifth commit from the left in the
graph above).As can be expected, the command is git merge
.
At this point, we need to talk about conflicts.See Pro Git’s chapter on merging and basic conflict resolution for the details on managing conflicts in practice.
Until now, every action was simple: we can move around, add names, and
add some changes. But now we are trying to reconcile two different
versions into a single one. These two versions can be incompatible,
and in this case the merge commit will have to choose which lines of
each version to keep. If however, there is no conflict, the merge
commit will be empty: it will have two parents, but will not contain
any changes itself.
Moving commits: rebasing and squashing
Until now, all the actions we’ve seen were append-only. We were only adding stuff, and it would be easy to just remove a node from the graph, and to move the various labels accordingly, to return to the previous state.
Sometimes we want to do more complex manipulation of the graph: moving
a commit and all its descendants to another location in the
graph. This is called a rebase.That you can perform
with git rebase
(destructive!).
In this case, we moved the branch feature
from its old position (in
red) to a new one on top of master
(in green).
When I say “move the branch feature
”, I actually mean something
slightly different than before. Here, we don’t just move the label
feature
, but also the entire chain of commits starting from the one
pointed by feature
up to the common ancestor of feature
and its
base branch (here master
).
In practice, what we have done is deleted three commits, and added three brand new commits. Git actually helps us here by creating commits with the same changes. Sometimes, it is not possible to apply the same changes exactly because the original version is not the same. For instance, if one of the commits changed a line that no longer exist in the new base, there will be a conflict. When rebasing, you may have to manually resolve these conflicts, similarly to a merge.
It is often interesting to rebase before merging, because then we can
avoid merge commits entirely. Since feature
has been rebased on top
of master
, when merging feature
onto master
, we can just
fast-forward master
, in effect just moving the master
label
where feature
is:You can control whether or not git merge
does a
fast-forward with the --ff-only
and --no-ff
flags.
Another manipulation that we can do on the graph is squashing,
i.e. lumping several commits together in a single one.{-}
Use git squash
(destructive!).
Here, the three commits of the feature
branch have been condensed
into a single one. No conflict can happen, but we lose the history of
the changes. Squashing may be useful to clean up a complex history.
Squashing and rebasing, taken together, can be extremely powerful tools to entirely rewrite the history of a repo. With them, you can reorder commits, squash them together, moving them elsewhere, and so on. However, these commands are also extremely dangerous: since you overwrite the history, there is a lot of potential for conflicts and general mistakes. By contrast, merges are completely safe: even if there are conflicts and you have messed them up, you can always remove the merge commit and go back to the previous state. But when you rebase a set of commits and mess up the conflict resolution, there is no going back: the history has been lost forever, and you generally cannot recover the original state of the repository.
Remotes: sharing your work with others
You can use Git as a simple version tracking system for your own projects, on your own computer. But most of the time, Git is used to collaborate with other people. For this reason, Git has an elaborate system for sharing changes with others. The good news is: everything is still represented in the graph! There is nothing fundamentally different to understand.
When two different people work on the same project, each will have a version of the repository locally. Let’s say that Alice and Bob are both working on our project.
Alice has made a significant improvement to the project, and has
created several commits, that are tracked in the feature
branch she
has created locally. The graph above (after rebasing) represents
Alice’s repository. Bob, meanwhile, has the same repository but
without the feature
branch. How can they share their work? Alice can
send the commits from feature
to the common ancestor of master
and
feature
to Bob. Bob will see this branch as part of a remote
graph, that will be superimposed on his graph: You can add, remove, rename, and generally manage
remotes with git remote
. To transfer data between you and a remote,
use git fetch
, git pull
(which fetches and merges in your local
branch automatically), and git push
.
The branch name he just got from Alice is prefixed by the name of the
remote, in this case alice
. These are just ordinary commits, and an
ordinary branch (i.e. just a label on a specific commit).
Now Bob can see Alice’s work, and has some idea to improve on it. So
he wants to make a new commit on top of Alice’s changes. But the
alice/feature
branch is here to track the state of Alice’s
repository, so he just creates a new branch just for him named
feature
, where he adds a commit:
Similarly, Alice can now retrieve Bob’s work, and will have a new
branch bob/feature
with the additional commit. If she wants, she can
now incorporate the new commit to her own branch feature
, making her
branches feature
and bob/feature
identical:
As you can see, sharing work in Git is just a matter of having additional branches that represent the graph of other people. Some branches are shared among different people, and in this case you will have several branches, each prefixed with the name of the remote. Everything is still represented simply in a single graph.
Additional concepts
Unfortunately, some things are not captured in the graph directly. Most notably, the staging area used for selecting changes for committing, stashing, and submodules greatly extend the capabilities of Git beyond simple graph manipulations. You can read about all of these in Pro Git.
Internals
Note: This section is not needed to use Git every day, or even to understand the concepts behind it. However, it can quickly show you that the explanations above are not pure abstractions, and are actually represented directly this way.
Let’s dive a little bit into Git’s internal representations to better
understand the concepts. The entire Git repository is contained in a
.git
folder.
Inside the .git
folder, you will find a simple text file called
HEAD
, which contains a reference to a location in the graph. For
instance, it could contain ref: refs/heads/master
. As you can see,
HEAD
really is just a pointer, to somewhere called
refs/heads/master
. Let’s look into the refs
directory to
investigate:
$ cat refs/heads/master
f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
This is just a pointer to a specific commit! You can also see that all
the other branches are represented the same way.You must have
noticed that our graphs above were slightly misleading: HEAD
does
not point directly to a commit, but to a branch, which itself points
to a commit. If you make HEAD
point to a commit directly, this is
called a “detached HEAD” state.
Remotes and tags are similar: they are in refs/remotes
and
refs/tags
.
Commits are stored in the objects
directory, in subfolders named
after the first two characters of their hashes. So the commit above is
located at objects/f1/9bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
. They
are usually in a binary format (for efficiency reasons) called
packfiles. But if you inspect it (with git show
), you will see the
entire contents (parents, message, diff).
Further reading
To know more about Git, specifically how to use it in practice, I recommend going through the excellent Pro Git book, which covers everything there is to know about the various Git commands and workflows.
The Git man pages (also available via man
on your system) have a
reputation of being hard to read, but once you have understood the
concepts behind repos, commits, branches, and remotes, they provide an
invaluable resource to exploit all the power of the command line
interface and the various commands and options.Of course,
you could also use the awesome Magit in Emacs, which will greatly
facilitate your interactions with Git with the additional benefit of
helping you discover Git’s capabilities.
Finally, if you are interested in the implementation details of Git, you can follow Write yourself a Git and implement Git yourself! (This is surprisingly quite straightforward, and you will end up with a much better understanding of what’s going on.) The chapter on Git in Brown and Wilson (2012) is also excellent.