Exploring .git folder of TensorFlow


Reading time: 40 minutes

We will explore the .git folder TensorFlow along with .github folder and .gitignore file. The first step is to get the TensorFlow source code from GitHub. We will clone it using the following command in the command line:

git clone https://github.com/tensorflow/tensorflow

TensorFlow will be cloned and the output of the above command will be like:

Cloning into 'tensorflow'...
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 684345 (delta 13), reused 21 (delta 5), pack-reused 684298
Receiving objects: 100% (684345/684345), 389.45 MiB | 3.09 MiB/s, done.
Resolving deltas: 100% (555319/555319), done.
Checking out files: 100% (20219/20219), done.

Move into the tensorflow directory:

cd tensorflow

We need to check the contents of tensorflow to determine which folder we want to explore to understand git. ls is a command to list out the contents.

ls

The output will be like:

ACKNOWLEDGMENTS     CODEOWNERS       ISSUE_TEMPLATE.md  tensorflow
ADOPTERS.md         configure        LICENSE            third_party
arm_compiler.BUILD  configure.cmd    models.BUILD       tools
AUTHORS             configure.py     README.md          WORKSPACE
BUILD               CONTRIBUTING.md  RELEASE.md
CODE_OF_CONDUCT.md  ISSUES.md        SECURITY.md

The problem is that the ls command does not list out hidden folder. To list out all contents including hidden files, we need to use:

ls -a

The output will be:

.                   BUILD               .git               README.md
..                  CODE_OF_CONDUCT.md  .github            RELEASE.md
ACKNOWLEDGMENTS     CODEOWNERS          .gitignore         SECURITY.md
ADOPTERS.md         configure           ISSUES.md          tensorflow
arm_compiler.BUILD  configure.cmd       ISSUE_TEMPLATE.md  third_party
AUTHORS             configure.py        LICENSE            tools
.bazelrc            CONTRIBUTING.md     models.BUILD       WORKSPACE

We can see multiple folder of our interest:

  • .git (a folder)
  • .gitignore (a file)
  • .github (for GitHub) (We will explore it later)

Before we go into the adventure, let us take in a branch and a tag.

Let us get the r1.14 branch and then go back to master branch as follows:

git checkout r1.14
git checkout master

Let us get a tag from TensorFlow as:

git checkout v1.14.0

The output of the command will be:

Checking out files: 100% (9746/9746), done.
Note: checking out 'v1.14.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 87989f6959 Add Sergii Khomenko to contributor list

With this, we have created some logs and taken a new branch and tags. Let us get started now.

.git folder

Let us go into .git folder

cd .git

We will list out all contents using ls -a:

.   branches  description  hooks  info  objects      refs
..  config    HEAD         index  logs  packed-refs

Folders: branches, hooks, info, objects, refs, logs

Files: description, config, HEAD, index, logs, packed-refs

All folders and files are of our interest and we will go through them one by one.

branches folder

Initially, if we go into branches folder and list out the contents, we will see that it is empty. This is because branches folder is deprecated. Initially (before 2009), it was used for specifying URLs for various operations like git fetch but Git decided to switch to a different approach.

For this, some repositories may not have this folder. Recently (in 2017), Git brought back branches folder but it is not being used for any feature.

Hence, this folder will remain empty.

hooks folder

hooks folder in git consists of shell scripts which are executed before a specific git operation. This is useful specially when the project is large and involves a large number of components/ checking. For example, before every git push, one can check if the build of TensorFlow is passing or not and take action accordingly.

The contents of hooks folder is as follows:

.                          post-update.sample         pre-rebase.sample
..                         pre-applypatch.sample      pre-receive.sample
applypatch-msg.sample      pre-commit.sample          update.sample
commit-msg.sample          prepare-commit-msg.sample
fsmonitor-watchman.sample  pre-push.sample

As we see, there are 11 hooks in TensorFlow. For example, the file pre-push.sample is the shell script that is run before every git push command in TensorFlow. Let us check this file by opening it:

#!/bin/sh
remote="$1"
url="$2"
z40=0000000000000000000000000000000000000000
while read local_ref local_sha remote_ref remote_sha
do
        if [ "$local_sha" = $z40 ]
        then
                # Handle delete
                :
        else
                if [ "$remote_sha" = $z40 ]
                then
                        # New branch, examine all commits
                        range="$local_sha"
                else
                        # Update to existing branch, examine new commits
                        range="$remote_sha..$local_sha"
                fi
                # Check for WIP commit
                commit=`git rev-list -n 1 --grep '^WIP' "$range"`
                if [ -n "$commit" ]
                then
                        echo >&2 "Found WIP commit in $local_ref, not pushing"
                        exit 1
                fi
        fi
done
exit 0

As we see, it prevents pushing commits with log message "WIP" which denotes the commit is work in progress and shall be pushed later.

Some hooks are provided by default and we can add new hooks or modify existing hooks as TensorFlow has done with pre push hook.

The other files are super interesting as well. Do check it and enjoy with a drink.

info folder

It has a file named exclude which contains patterns which are to be ignored. In TensorFlow, the contents of this file is as follows:

# git ls-files --others --exclude-from=.git/info/exclude
# Lines that start with '#' are comments.
# For a project mostly in C, the following would be a good set of
# exclude patterns (uncomment them if you want to use them):
# *.[oa]
# *~

As we see TensorFlow does not have any preset exclude patterns. This is good as it can be decided during code review/ pull request review whether an implementation strategy is acceptable or not.

objects folder

objects folder has two folders within it namely:

  • info
  • pack

info folder is empty. pack folder has two files as follows:

.   pack-1759b2236450cbd53a5a2aa4ef109e12b48aaade.idx
..  pack-1759b2236450cbd53a5a2aa4ef109e12b48aaade.pack

The objects folder contains hashed files of the changes made. As we have not made any changes locally, we do not have such files. Once you make some commits to fix a bug or add a feature, new object files will be created which you can explore.

refs folder

refs folder has three folders within it:

  • heads
  • remotes
  • tags

heads folder has files for each branch that we have used in our clone. As we have cloned a separate branch (r1.12) as well, we have two files namely:

  • master
  • r1.14

Each files has a hash value pointing to the last commit in each branch where the head of git should point to. Content of master file is:

0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e

Content of r1.14 file is:

00fad90125b18b80fe054de1055770cfb8fe4ba3

We have one remote repository so remote folder has only one folder namely:

  • origin

origin folder has a file named HEAD with the following content:

ref: refs/remotes/origin/master

If we add a new remote URL, a new folder with HEAD file will appear.

In the tags folder, we have no files or folders as we have not initiated any tags.

logs folder

In the logs folder, we have one file HEAD and one folder refs. refs folder is same as the above refs folder tracks the logs of each activity. HEAD file has stored the logs as:

0000000000000000000000000000000000000000 0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e opengenus <team@opengenus.org> 1568766544 -0400        clone: from https://github.com/tensorflow/tensorflow
0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e 00fad90125b18b80fe054de1055770cfb8fe4ba3 opengenus <team@opengenus.org> 1568767275 -0400        checkout: moving from master to r1.14
00fad90125b18b80fe054de1055770cfb8fe4ba3 0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e opengenus <team@opengenus.org> 1568767395 -0400        checkout: moving from r1.14 to master
0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e 87989f69597d6b2d60de8f112e1e3cea23be7298 opengenus <team@opengenus.org> 1568774984 -0400        checkout: moving from master to v1.14.0
87989f69597d6b2d60de8f112e1e3cea23be7298 0fa9f305bb24b2222ddff8ed0300c2e77c9cb96e opengenus <team@opengenus.org> 1568775029 -0400        checkout: moving from 87989f69597d6b2d60de8f112e1e3cea23be7298 to master

As we can see, it has captured the activity of git commands used. For logs we can infer the following activity:

  • cloned tensorflow
  • moved from master to r1.14
  • moved from r1.14 to master
  • moved from master to v1.14.0
  • moved from v1.14.0 to master

Files

Let us check the files one by one. We had the following files: description, config, HEAD, index, logs, packed-refs

The content of the config file is as follows:

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[remote "origin"]
        url = https://github.com/tensorflow/tensorflow
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master
[branch "r1.14"]
        remote = origin
        merge = refs/heads/r1.14

It has details of remote branch and URL for each local branch with a few other parameters like logallrefupdates which if set to true will capture all activity in form of log.

We have only one remote and two branches.

The contents of description file is as follows:

Unnamed repository; edit this file 'description' to name the repository.

This file contains the name of the repository which is usually used by hook scripts. For example, if there is a hook script

Contents of HEAD file:

ref: refs/heads/master

index file is a binary file which has the list of files along with permissions and SHA1 hash of the objects associated. We can see its content using git ls-file command.

git ls-file

A small section of the output:

tensorflow/core/kernels/gather_nd_op.h
tensorflow/core/kernels/gather_nd_op_cpu_impl.h
tensorflow/core/kernels/gather_nd_op_cpu_impl_0.cc
tensorflow/core/kernels/gather_nd_op_cpu_impl_1.cc

The first few lines of packed-ref file is as follows:

# pack-refs with: peeled fully-peeled sorted
4be56f381cd000e91f79209aaf150636db6fb840 refs/remotes/origin/0.6.0
807f95063c1e1072fe5b936abf529e133010ec46 refs/remotes/origin/1.8.0
ca2f3de8daa10b18fe2314b2494e94317885b928 refs/remotes/origin/backend_api_cherrypick
711d4fe8132c3cdd70c3230997189d1b87c695de refs/remotes/origin/bananabowl-patch-1
1e57145558a50a972963217c468118f5c3569364 refs/remotes/origin/cherrypick_batch_dot

It has mapping of a hash pointing to a commit along with the branch name. It is used while going into a branch as the hash can be used to identify the point it needs to go to.

With this, we have explored the .git folder and are left with .github folder and .gitignore file. Let us dive into it.

.github folder

.github folder has only one folder

.  ..  ISSUE_TEMPLATE

ISSUE_TEMPLATE has the following files within it:

.   00-bug-performance-issue.md     20-documentation-issue.md  40-tflite-op-request.md
..  10-build-installation-issue.md  30-feature-request.md      50-other-issues.md

These files contain templates for filing issues, opening pull requests and others. The files are used as follows:

  • 00-bug-performance-issue.md: template for reporting a bug or a performance issue
  • 20-documentation-issue.md: template for documentation
  • 40-tflite-op-request.md: template for reporting ops being used or missed
  • 10-build-installation-issue.md: template for build/installation issues
  • 30-feature-request.md: template for opening a feature request
  • 50-other-issues.md: template for any other non-support related issues

.gitignore file

The .gitignore file contains the list of files and folders which will not be tracked git that is if any changes are made to these files or folders, it will not make any changes in the git flow.

The contents of .gitignore file of TensorFlow is as follows:

.DS_Store
.ipynb_checkpoints
node_modules
/.bazelrc.user
/.tf_configure.bazelrc
/bazel-*
/bazel_pip
/tools/python_bin_path.sh
/tensorflow/tools/git/gen
/pip_test
/_python_build
*.pyc
__pycache__
*.swp
.vscode/
cmake_build/
tensorflow/contrib/cmake/_build/
.idea/**
/build/
[Bb]uild/
/tensorflow/core/util/version_info.cc
/tensorflow/python/framework/fast_tensor_util.cpp
/tensorflow/lite/gen/**
/tensorflow/lite/tools/make/downloads/**
/api_init_files_list.txt
/estimator_api_init_files_list.txt
*.whl

# Android
.gradle
.idea
*.iml
local.properties
gradleBuild

# iOS
*.pbxproj
*.xcworkspace
/*.podspec
/tensorflow/lite/**/[ios|objc|swift]*/BUILD
/tensorflow/lite/examples/ios/simple/data/*.tflite
/tensorflow/lite/examples/ios/simple/data/*.txt
Podfile.lock
Pods
xcuserdata

With this, we have explored the entire git strategy of TensorFlow and you must have learnt a lot in the process like TensorFlow will not accept commits with "WIP" text in form of a custom hook, it has custom GitHub templates and much more.

Enjoy!