vgnshiyer's blog

Introduction

In the previous post, I explored the intricacies of git objects, explaining their hierarchy and how git handles them. Building on that foundation, this blog takes a hands-on approach by implementing a miniature version of Git, aptly named "TinyGit" in python. Note that this is a heavily simplified version of the actual git, purely developed out of curiosity as a fun weekend project.

Project Structure

In this write-up I explain a simple implementation of the init, status, add and commit commands of git, using the command pattern.

For the sake of brevity, I have omitted the implementation of certain utility functions from this write-up, to keep it focussed. You can find them in my github, or can be easily written by your favourite AI model for that matter.

tinygit/
│
├── commands/
│   ├── tinygit_cmd.py  # base command class
│   ├── ...             # tinygit commands
│
├── models/
│   ├── ...             # tinygit data models
│
├── config.py           # tinygit config
├── tinygit.py          # command class controller
└── main.py             # entry point for the cli

Config

In order to store the history of the user's project, TinyGit needs to know where to exactly store its objects. The config class will contain just that. It will contain the requisite directory names, the index file path, and the objects path that our commands can use for performing their designated actions.

class Config:
    git_dir = ".tinygit"  # directory for storing git metadata
    index_file = "index"  # file for storing the staging area
    head_file = "HEAD"  # file for storing the current branch
    refs_dir = "refs"  # directory for storing branch references
    objects_dir = "objects"  # directory for storing git objects
    default_branch = "main"  # default branch name

    def __init__(self):
        self.repo_path = os.path.join(os.getcwd(), self.git_dir)
        self.index_path = os.path.join(self.repo_path, self.index_file)
        self.head_path = os.path.join(self.repo_path, self.head_file)
        self.refs_dir = os.path.join(self.repo_path, self.refs_dir)
        self.objects_dir = os.path.join(self.repo_path, self.objects_dir)

The Index

The index or the staging area is a temporary area used to track the set of files that are about to be committed into the database in the next commit. For this, I use a simple yaml file to track a list of files to be committed.

class Index:
    """
    The data structure to store the state of the Index (Staging area)

    (format)
    files:
      - file1_path
      - file2_path
      - ...
    """

    def __init__(self, index_path: str):
        self.index_path = index_path
        self.files = []

    def load(self):
        with open(self.index_path, "r") as f:
            data = yaml.safe_load(f)
            self.files = data["files"] if data else []

    def add_file(self, path: str):  # add a new file to index
        if path not in self.files:
            self.files.append(path)

    def save(self):
        with open(self.index_path, "w") as f:
            yaml.dump({"files": self.files}, f)

    def clear(self):  # clear all index files (Usually done after a commit)
        self.files = []

Running TinyGit

With the command pattern, we need an orchestrator that will route the users' request to the appropriate command class along with the appropriate inputs. The below implementation allows setting the appropriate commands and run them.

class TinyGit:
    def __init__(self):
        self.config = Config()
        self.command: TinyGitCmd | None = None

    def set_command(self, command: TinyGitCmd):
        self.command = command

    def run(self) -> str:
        if self.command is None:
            raise ValueError("No command set")
        return self.command.execute(self.config)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("command", type=str)
    parser.add_argument("-m", "--message", type=str, required=False)
    parser.add_argument("-f", "--file", type=str, required=False)
    args = parser.parse_args()

    tinygit = TinyGit()
    command = args.command
    if command == "init":
        tinygit.set_command(InitCmd())
    elif command == "status":
        tinygit.set_command(StatusCmd())
    elif command == "commit":
        if not args.message:
            raise ValueError("Message is required for commit command")
        tinygit.set_command(CommitCmd(message=args.message))
    elif command == "add":
        if not args.file:
            raise ValueError("File is required for add command")
        tinygit.set_command(AddCmd(path=args.file))
    print(tinygit.run())

The Objects

For the commit command to store the changes to the files, it will need a set of objects to represent them. Below are the set of data models (namely tree, blob and commit) that TinyGit will use to represent the projects' history.

Blob object

class BlobObject(TinyGitObject):
    type = b'blob'

    def __init__(self, path: str):
        self.name = os.path.basename(path)
        with open(path, 'rb') as f:
            self.content = f.read()

    def get_data(self) -> str:
        return str(self.content)

Tree object

class TreeObject(TinyGitObject):
    type = b'tree'
    children: list[tuple[bytes, str, str]] = []
    name: str = ""

    def __init__(self, graph: dict[str, list[str]], root: str = "", objects_dir: str = ""):
        self.name = root + "/"
        for file_or_dir in graph[root]:
            if file_or_dir not in graph:  # keys are only dir
                blob = BlobObject(root + "/" + file_or_dir)
                blob.save(objects_dir)
                self.children.append((blob.type, blob.get_hash(), blob.name))
            else:
                tree = TreeObject(graph, file_or_dir, objects_dir)
                tree.save(objects_dir)
                self.children.append((tree.type, tree.get_hash(), tree.name))

    def get_data(self) -> str:
        return "\n".join([
            f"{obj[0]} {obj[1]}\t{obj[2]}"
            for obj in self.children
        ])

Now there is some cohesion between the CommitCmd() and the TreeObject which is fine. Don't bother to get into the weeds with the __init__() method for now.

At a high level, just know that it takes a graph of the directory structure, and recursively (using dfs) forms a tree object that looks something like this.

For a directory structure like this

/
├── file
├── foo/
    ├── bar

The graph variable would look like

graph = {
  "/": ["file", "foo/"],
  "foo/": ["bar"]
}

And the tree object would look like

        ( / )
        /   \
       /     \
  (file)    (foo/)
               \
              (bar)

This tree object is now ready to be directly attached to a commit object.

Commit object

class CommitObject(TinyGitObject):
    type = b'commit'

    def __init__(self, tree_hash: str, message: str, parent_hash: str | None = None):
        self.tree_hash = tree_hash
        self.message = message
        self.parent_hash = parent_hash

    def get_data(self) -> str:
        return f"tree {self.tree_hash}\n" + \
            (f"parent {self.parent_hash}\n" if self.parent_hash else "") + \
            "\n" + \
            self.message

TinyGit commands

The Init Command

The init command is used to initialize a new git repository.

class InitCmd(TinyGitCmd):
    def execute(self, config: Config) -> str:
        os.makedirs(config.repo_path, exist_ok=True)
        os.makedirs(config.refs_dir, exist_ok=True)
        os.makedirs(config.objects_dir, exist_ok=True)

        if not os.path.exists(config.head_path):
            with open(config.head_path, "w") as f:
                f.write(config.default_branch)

        if not os.path.exists(config.index_path):
            with open(config.index_path, "w") as f:
                f.write("")

        return "Initialized empty TinyGit repository in {}".format(config.repo_path)

Running the command

$ mkdir myrepo && cd myrepo
$ python ../main.py init

# output
Initialized empty TinyGit repository in .tinygit

$ tree .tinygit/

# output
.tinygit/
├── HEAD
├── index
├── objects
└── refs

3 directories, 2 files

The init command creates a new directory called .tinygit in the current working directory. This directory contains the following files:

HEAD: A file that contains the current branch name.
refs: A directory that contains the references to the commits.
objects: A directory that contains the objects of the repository.

The Status Command

The status command is used to show the current state of the repository.

This implementation shows:

the current branch
the files in the staging area
the files in the working directory that are not tracked by TinyGit

class StatusCmd(TinyGitCmd):
    def execute(self, config: Config) -> str:
        if not self.is_repository_initialized(config):
            raise UninitializedRepositoryError

        current_branch = self._get_current_branch(config)

        index = Index(config.index_path)
        index.load()

        files_staged_for_commit = index.files
        files_in_working_directory = self._get_files_in_working_directory(config)
        files_not_staged_for_commit = [
            file for file in files_in_working_directory if file not in files_staged_for_commit
        ]

        status = f"On branch {current_branch}\n"

        status += "Changes to be committed:\n"
        status += f"        {', '.join(files_staged_for_commit)}\n"

        status += "Changes not staged for commit:\n"
        status += f"        {', '.join(files_not_staged_for_commit)}\n"

        return status

Running the command

$ echo "hello world" > readme.md
$ python ../main.py status

# output
On branch main
Changes to be committed:
    
Changes not staged for commit:
        readme.md

The Add Command

The add command is used to add a set of changes to the Index (Staging area). When you are confident with your changes in your working directory, you can move them to the staging area, telling git that you want it to include them in your next commit.

This implementation:

adds files to the index
updates the index file with the new changes

class AddCmd(TinyGitCmd):
    def __init__(self, path: str):
        self.path = path

    def execute(self, config: Config) -> str:
        if not self.is_repository_initialized(config):
            raise UninitializedRepositoryError

        if not os.path.exists(self.path):
            raise FileNotFoundError(f"File {self.path} not found")

        index = Index(config.index_path)
        index.load()

        if os.path.isdir(self.path):
            for root, _, files in os.walk(self.path):
                for file in files:
                    index.add_file(os.path.join(root, file))
        else:
            index.add_file(self.path)
        index.save()

        return f"Added {self.path} to the staging area"

Running the command

$ python ../main.py add readme.md

# output
Added readme.md to the staging area

$ python ../main.py status

# output
On branch main
Changes to be committed:
        readme.md
Changes not staged for commit:

$ cat .tinygit/index

# output
files:
  - readme.md

The Commit Command

The commit command tells git to take the list of changes currently in the staging area, and store them to git's database along with a terse description of the changes provided by the user.

This implementation:

creates the tree object
creates the commit object
updates the head to point to the new commit
clears the staging area

class CommitCmd(TinyGitCmd):
    def __init__(self, message: str):
        self.message = message

    def execute(self, config: Config) -> str:
        if not self.is_repository_initialized(config):
            raise UninitializedRepositoryError

        index = Index(config.index_path)
        index.load()

        if not index.files:
            raise NoChangesToCommitError

        num_files = len(index.files)
        tree_hash = self._create_tree(index, config)
        commit_hash = self._create_commit(tree_hash, self.message, config)

        self._update_head(commit_hash, config)

        index.clear()
        index.save()
        return f"[{commit_hash}] {self.message}" + \
            f"\n{num_files} files changed"

    def _create_commit(self, tree_hash: str, message: str, config: Config) -> str:
        parent_commit_hash = self._get_current_commit_hash(config)
        commit = CommitObject(tree_hash, message, parent_commit_hash)
        return commit.save(config.objects_dir)

    def _update_head(self, commit_hash: str, config: Config) -> None:
        current_branch = self._get_current_branch(config)
        with open(config.refs_dir + "/" + current_branch, "w") as f:
            f.write(commit_hash)

        with open(config.head_path, "w") as f:
            f.write(current_branch)

Creating the tree object.

def _create_tree(self, index: Index, config: Config) -> str:
    """
    Creates the graph
    e.g.
    graph = {
        "/": ["file", "foo/"],
        "foo/": ["bar"]
    }

    Converts into a tree object
    """
    files_to_commit = [(".", file) for file in index.files]

    graph = defaultdict(list)
    while files_to_commit:
        parent, file = files_to_commit.pop()
        parts = file.split("/", 1)
        if len(parts) > 1:
            child = parts[0]
            files_to_commit.append((child, parts[1]))
            graph[parent].append(child)
        else:
            graph[parent].append(parts[0])

    tree = TreeObject(graph, root=".", objects_dir=config.objects_dir)
    return tree.save(config.objects_dir)

Running the command

$ python ../main.py commit -m "initial commit"

# output
[7d88a64fa79aa10a4231070b6d28485d5dd2306d] initial commit
1 files changed

$ tree .tinygit/

# output
.tinygit/
├── HEAD
├── index
├── objects
│   ├── 7d88a64fa79aa10a4231070b6d28485d5dd2306d
│   ├── aecb3ab7682122917c875d74bde7335654918294
│   └── b3a1c367baa5bfaef9286b27eb59d969821a9373
└── refs
    └── main

3 directories, 6 files

The commit command created 3 new objects in the database:

A tree object
A commit object
A blob object

$ cat .tinygit/HEAD

# output
main

$ cat .tinygit/refs/main

# output
7d88a64fa79aa10a4231070b6d28485d5dd2306d

Conclusion

That was it for this post. I hope you enjoyed reading it as much as I enjoyed building it.

The code for this blog can be found here.

Exploring the internals of Git. (Part 2)

Introduction

Project Structure

Config

The Index

Running TinyGit

The Objects

TinyGit commands

The Init Command

The Status Command

The Add Command

The Commit Command

Conclusion

On this page