Introduction
In the previous post, I explored the intricacies of git objects, explaining their hierarchy and how git handles them. Building on that foundation, this blog takes a hands-on approach by implementing a miniature version of Git, aptly named "TinyGit" in python. Note that this is a heavily simplified version of the actual git, purely developed out of curiosity as a fun weekend project.
Project Structure
In this write-up I explain a simple implementation of the init, status, add and commit commands of git, using the command pattern.
For the sake of brevity, I have omitted the implementation of certain utility functions from this write-up, to keep it focussed. You can find them in my github, or can be easily written by your favourite AI model for that matter.
tinygit/
│
├── commands/
│ ├── tinygit_cmd.py # base command class
│ ├── ... # tinygit commands
│
├── models/
│ ├── ... # tinygit data models
│
├── config.py # tinygit config
├── tinygit.py # command class controller
└── main.py # entry point for the cli
Config
In order to store the history of the user's project, TinyGit needs to know where to exactly store its objects. The config class will contain just that. It will contain the requisite directory names, the index file path, and the objects path that our commands can use for performing their designated actions.
class Config:
git_dir = ".tinygit" # directory for storing git metadata
index_file = "index" # file for storing the staging area
head_file = "HEAD" # file for storing the current branch
refs_dir = "refs" # directory for storing branch references
objects_dir = "objects" # directory for storing git objects
default_branch = "main" # default branch name
def __init__(self):
self.repo_path = os.path.join(os.getcwd(), self.git_dir)
self.index_path = os.path.join(self.repo_path, self.index_file)
self.head_path = os.path.join(self.repo_path, self.head_file)
self.refs_dir = os.path.join(self.repo_path, self.refs_dir)
self.objects_dir = os.path.join(self.repo_path, self.objects_dir)
The Index
The index or the staging area is a temporary area used to track the set of files that are about to be committed into the database in the next commit. For this, I use a simple yaml file to track a list of files to be committed.
class Index:
"""
The data structure to store the state of the Index (Staging area)
(format)
files:
- file1_path
- file2_path
- ...
"""
def __init__(self, index_path: str):
self.index_path = index_path
self.files = []
def load(self):
with open(self.index_path, "r") as f:
data = yaml.safe_load(f)
self.files = data["files"] if data else []
def add_file(self, path: str): # add a new file to index
if path not in self.files:
self.files.append(path)
def save(self):
with open(self.index_path, "w") as f:
yaml.dump({"files": self.files}, f)
def clear(self): # clear all index files (Usually done after a commit)
self.files = []
Running TinyGit
With the command pattern, we need an orchestrator that will route the users' request to the appropriate command class along with the appropriate inputs. The below implementation allows setting the appropriate commands and run them.
class TinyGit:
def __init__(self):
self.config = Config()
self.command: TinyGitCmd | None = None
def set_command(self, command: TinyGitCmd):
self.command = command
def run(self) -> str:
if self.command is None:
raise ValueError("No command set")
return self.command.execute(self.config)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("command", type=str)
parser.add_argument("-m", "--message", type=str, required=False)
parser.add_argument("-f", "--file", type=str, required=False)
args = parser.parse_args()
tinygit = TinyGit()
command = args.command
if command == "init":
tinygit.set_command(InitCmd())
elif command == "status":
tinygit.set_command(StatusCmd())
elif command == "commit":
if not args.message:
raise ValueError("Message is required for commit command")
tinygit.set_command(CommitCmd(message=args.message))
elif command == "add":
if not args.file:
raise ValueError("File is required for add command")
tinygit.set_command(AddCmd(path=args.file))
print(tinygit.run())
The Objects
For the commit command to store the changes to the files, it will need a set of objects to represent them. Below are the set of data models (namely tree, blob and commit) that TinyGit will use to represent the projects' history.
Blob object
class BlobObject(TinyGitObject):
type = b'blob'
def __init__(self, path: str):
self.name = os.path.basename(path)
with open(path, 'rb') as f:
self.content = f.read()
def get_data(self) -> str:
return str(self.content)
Tree object
class TreeObject(TinyGitObject):
type = b'tree'
children: list[tuple[bytes, str, str]] = []
name: str = ""
def __init__(self, graph: dict[str, list[str]], root: str = "", objects_dir: str = ""):
self.name = root + "/"
for file_or_dir in graph[root]:
if file_or_dir not in graph: # keys are only dir
blob = BlobObject(root + "/" + file_or_dir)
blob.save(objects_dir)
self.children.append((blob.type, blob.get_hash(), blob.name))
else:
tree = TreeObject(graph, file_or_dir, objects_dir)
tree.save(objects_dir)
self.children.append((tree.type, tree.get_hash(), tree.name))
def get_data(self) -> str:
return "\n".join([
f"{obj[0]} {obj[1]}\t{obj[2]}"
for obj in self.children
])
Now there is some cohesion between the CommitCmd() and the TreeObject which is fine. Don't bother to get into the weeds with the __init__()
method for now.
At a high level, just know that it takes a graph of the directory structure, and recursively (using dfs) forms a tree object that looks something like this.
For a directory structure like this
/
├── file
├── foo/
├── bar
The graph variable would look like
graph = {
"/": ["file", "foo/"],
"foo/": ["bar"]
}
And the tree object would look like
( / )
/ \
/ \
(file) (foo/)
\
(bar)
This tree object is now ready to be directly attached to a commit object.
Commit object
class CommitObject(TinyGitObject):
type = b'commit'
def __init__(self, tree_hash: str, message: str, parent_hash: str | None = None):
self.tree_hash = tree_hash
self.message = message
self.parent_hash = parent_hash
def get_data(self) -> str:
return f"tree {self.tree_hash}\n" + \
(f"parent {self.parent_hash}\n" if self.parent_hash else "") + \
"\n" + \
self.message
TinyGit commands
The Init Command
The init command is used to initialize a new git repository.
class InitCmd(TinyGitCmd):
def execute(self, config: Config) -> str:
os.makedirs(config.repo_path, exist_ok=True)
os.makedirs(config.refs_dir, exist_ok=True)
os.makedirs(config.objects_dir, exist_ok=True)
if not os.path.exists(config.head_path):
with open(config.head_path, "w") as f:
f.write(config.default_branch)
if not os.path.exists(config.index_path):
with open(config.index_path, "w") as f:
f.write("")
return "Initialized empty TinyGit repository in {}".format(config.repo_path)
Running the command
$ mkdir myrepo && cd myrepo
$ python ../main.py init
# output
Initialized empty TinyGit repository in .tinygit
$ tree .tinygit/
# output
.tinygit/
├── HEAD
├── index
├── objects
└── refs
3 directories, 2 files
The init command creates a new directory called .tinygit
in the current working directory. This directory contains the following files:
HEAD
: A file that contains the current branch name.refs
: A directory that contains the references to the commits.objects
: A directory that contains the objects of the repository.
The Status Command
The status command is used to show the current state of the repository.
This implementation shows:
- the current branch
- the files in the staging area
- the files in the working directory that are not tracked by TinyGit
class StatusCmd(TinyGitCmd):
def execute(self, config: Config) -> str:
if not self.is_repository_initialized(config):
raise UninitializedRepositoryError
current_branch = self._get_current_branch(config)
index = Index(config.index_path)
index.load()
files_staged_for_commit = index.files
files_in_working_directory = self._get_files_in_working_directory(config)
files_not_staged_for_commit = [
file for file in files_in_working_directory if file not in files_staged_for_commit
]
status = f"On branch {current_branch}\n"
status += "Changes to be committed:\n"
status += f" {', '.join(files_staged_for_commit)}\n"
status += "Changes not staged for commit:\n"
status += f" {', '.join(files_not_staged_for_commit)}\n"
return status
Running the command
$ echo "hello world" > readme.md
$ python ../main.py status
# output
On branch main
Changes to be committed:
Changes not staged for commit:
readme.md
The Add Command
The add command is used to add a set of changes to the Index (Staging area). When you are confident with your changes in your working directory, you can move them to the staging area, telling git that you want it to include them in your next commit.
This implementation:
- adds files to the index
- updates the index file with the new changes
class AddCmd(TinyGitCmd):
def __init__(self, path: str):
self.path = path
def execute(self, config: Config) -> str:
if not self.is_repository_initialized(config):
raise UninitializedRepositoryError
if not os.path.exists(self.path):
raise FileNotFoundError(f"File {self.path} not found")
index = Index(config.index_path)
index.load()
if os.path.isdir(self.path):
for root, _, files in os.walk(self.path):
for file in files:
index.add_file(os.path.join(root, file))
else:
index.add_file(self.path)
index.save()
return f"Added {self.path} to the staging area"
Running the command
$ python ../main.py add readme.md
# output
Added readme.md to the staging area
$ python ../main.py status
# output
On branch main
Changes to be committed:
readme.md
Changes not staged for commit:
$ cat .tinygit/index
# output
files:
- readme.md
The Commit Command
The commit command tells git to take the list of changes currently in the staging area, and store them to git's database along with a terse description of the changes provided by the user.
This implementation:
- creates the tree object
- creates the commit object
- updates the head to point to the new commit
- clears the staging area
class CommitCmd(TinyGitCmd):
def __init__(self, message: str):
self.message = message
def execute(self, config: Config) -> str:
if not self.is_repository_initialized(config):
raise UninitializedRepositoryError
index = Index(config.index_path)
index.load()
if not index.files:
raise NoChangesToCommitError
num_files = len(index.files)
tree_hash = self._create_tree(index, config)
commit_hash = self._create_commit(tree_hash, self.message, config)
self._update_head(commit_hash, config)
index.clear()
index.save()
return f"[{commit_hash}] {self.message}" + \
f"\n{num_files} files changed"
def _create_commit(self, tree_hash: str, message: str, config: Config) -> str:
parent_commit_hash = self._get_current_commit_hash(config)
commit = CommitObject(tree_hash, message, parent_commit_hash)
return commit.save(config.objects_dir)
def _update_head(self, commit_hash: str, config: Config) -> None:
current_branch = self._get_current_branch(config)
with open(config.refs_dir + "/" + current_branch, "w") as f:
f.write(commit_hash)
with open(config.head_path, "w") as f:
f.write(current_branch)
Creating the tree object.
def _create_tree(self, index: Index, config: Config) -> str:
"""
Creates the graph
e.g.
graph = {
"/": ["file", "foo/"],
"foo/": ["bar"]
}
Converts into a tree object
"""
files_to_commit = [(".", file) for file in index.files]
graph = defaultdict(list)
while files_to_commit:
parent, file = files_to_commit.pop()
parts = file.split("/", 1)
if len(parts) > 1:
child = parts[0]
files_to_commit.append((child, parts[1]))
graph[parent].append(child)
else:
graph[parent].append(parts[0])
tree = TreeObject(graph, root=".", objects_dir=config.objects_dir)
return tree.save(config.objects_dir)
Running the command
$ python ../main.py commit -m "initial commit"
# output
[7d88a64fa79aa10a4231070b6d28485d5dd2306d] initial commit
1 files changed
$ tree .tinygit/
# output
.tinygit/
├── HEAD
├── index
├── objects
│ ├── 7d88a64fa79aa10a4231070b6d28485d5dd2306d
│ ├── aecb3ab7682122917c875d74bde7335654918294
│ └── b3a1c367baa5bfaef9286b27eb59d969821a9373
└── refs
└── main
3 directories, 6 files
The commit command created 3 new objects in the database:
- A tree object
- A commit object
- A blob object
$ cat .tinygit/HEAD
# output
main
$ cat .tinygit/refs/main
# output
7d88a64fa79aa10a4231070b6d28485d5dd2306d
Conclusion
That was it for this post. I hope you enjoyed reading it as much as I enjoyed building it.
The code for this blog can be found here.