Git 由什么构成？(2022)

Lobsters Hottest 2026/05/24 03:12 工具

git version-control tutorial internals zlib sha1

摘要

深入教程，解释 Git 的内部结构，包括对象、哈希以及 Git 如何存储数据，并提供 Go 和 shell 命令示例。

查看原文

查看缓存全文

缓存时间: 2026/05/24 04:53

# Git 由什么构成？来源：https://zserge.com/posts/git/ Git 可能会让人困惑。Git 可能会让人害怕。Git CLI 可能是你每天必须使用的最不直观的工具。但 Git 也是一个非常简洁且设计巧妙的版本控制系统，它绝对值得如此流行。为了证明这一点，我邀请你实现一个自己的微型 Git，它能够创建本地仓库、提交单个文件、查看提交日志以及检出该文件的某个修订版本。这不会超过几百行代码，我们会尽量保持简单。代码示例将使用 Go 语言，但任何其他语言也适合本教程。 ## git init 什么将一个空 *目录* 变成一个空的 Git *仓库*？你可能已经注意到 Git 将其所有内部数据存储在一个隐藏目录 `.git` 中。事实上，只需要创建几个特殊的文件/文件夹，就可以让 Git CLI 将其视为一个完全合法、空的仓库： ``` $ mkdir -p .git/objects/info .git/objects/pack .git/refs/heads .git/refs/tags $ echo "ref: refs/heads/main" > .git/HEAD $ tree .git .git ├── HEAD ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags $ git symbolic-ref --short HEAD main $ git log fatal: your current branch 'main' does not have any commits yet ``` 通过几个 shell 命令，我们欺骗了 Git，让它识别出我们有一个空的仓库，只有一个 `main` 分支且没有提交。但我们创建的这些目录里到底存储了什么呢？ ## 对象 Git 中几乎所有东西都作为对象存储：你提交的每个源文件成为一个 blob 对象，每次提交本身是一个对象，标签也是对象。例如，我们提交了一个 `file.txt`，内容为 `hello\n`（6 字节）。这将创建 3 个对象：一个 *blob*（实际文件内容），一个 *tree*（文件名和权限列表），以及一个 *commit*（指向已提交树的引用，包含提交者、时间戳等信息）。对于每个对象，Git 存储其对象类型（"blob"、"tree" 或 "commit"）以及字节长度。所以我们的 `hello\n` 内容实际上会变成 `blob 6\0hello\n` 对象数据。此外，Git 使用压缩来节省磁盘空间，因此我们的对象数据在作为 `./git/objects` 内的特殊文件写入磁盘之前，会使用 zlib 算法进行压缩。 ## 哈希在深入讨论写入对象的细节之前，我们先谈谈 Git 哈希。每个对象在 Git 仓库中通过其内容的 SHA 哈希唯一标识。最初 Git 使用 SHA-1 哈希算法，但最近的版本切换到 SHA-256 以减少哈希冲突。然而，SHA-1 仍在许多现代 Git 设置中广泛使用，我们这里也会用到它。回到我们的 `file.txt`，内容为 `hello\n`。该 blob 对象压缩后的内容可能如下所示（使用简单的 python 单行命令进行 zlib 压缩）： ``` $ python3 -c 'import sys,zlib; sys.stdout.buffer.write(zlib.compress(b"blob 6\0hello\n",6))' | hexdump -C 00000000 78 9c 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....| 00000010 00 1d c5 04 14 |.....| 00000015 ``` 实际上，不同的 zlib 实现可能使用不同的压缩级别和设置，因此生成的编码内容可能看起来不同。然而，SHA-1 哈希是从对象的未压缩原始数据计算得出的，并且始终相同： ``` $ printf "blob 6\0hello\n" | sha1sum ce013625030ba8dba906f756967f9e9ca394464a - ``` 现在让我们将其与某个测试仓库中的 Git CLI 结果进行比较： ``` $ mkdir hello $ cd hello $ git init $ echo "hello" > file.txt $ git ci -m 'initial commit' file.txt $ git cat-file blob ce013625 hello $ hexdump -C .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a 00000000 78 01 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....| 00000010 00 1d c5 04 14 |.....| 00000015 ``` Git 在存储对象时使用了一个小优化：哈希的前两个字符成为子目录名，其余部分成为存储压缩对象数据的文件名。让我们重现这种行为： ``` $ mkdir -p .git/objects/3a # 前两个字符："3a" $ printf "\x78\x9c\x4b\xca\xc9\x4f\x52\x30\x63\xc8\x48\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14" \ > .git/objects/3a/3cca74450ee8a0245e7c564ac9e68f8233b1e8 # 哈希的剩余部分 # 现在，Git CLI 能读取我们的 blob 吗？ $ git cat-file blob 3a3cca hello ``` ## 写入对象首先，我们引入一个 `Git` "类"，作为与仓库交互的主要入口点。我们还需要一个 `Hash` 类型来处理哈希的编码/解码： ```go type Git struct { Dir string // 存放 `.git` 的位置 Branch string // 当前分支，例如 "main" ... } type Hash []byte // Git 中的哈希以十六进制形式表示 func NewHash(b []byte) (Hash, error) { dec, err := hex.DecodeString(strings.TrimSpace(string(b))) if err != nil { return nil, err } return Hash(dec), nil } func (h Hash) String() string { return hex.EncodeToString(h) } ``` 由于 Git 中的所有内容都应该被压缩，我们可以开始实现两个执行压缩和解压缩的实用函数： ```go func zip(content []byte) ([]byte, error) { b := &bytes.Buffer{} zw := zlib.NewWriter(b) if _, err := zw.Write(content); err != nil { return nil, err } if err := zw.Close(); err != nil { return nil, err } return b.Bytes(), nil } func unzip(content []byte) ([]byte, error) { zw, err := zlib.NewReader(bytes.NewBuffer(content)) if err != nil { return nil, err } defer zw.Close() return io.ReadAll(zw) } ``` 现在我们可以编写一个辅助方法，将一个对象写入 Git 仓库： ```go // 这比一直使用 fmt.Sprintf 更短，而且我们经常会用到 func (g *Git) fmt(format string, args ...any) []byte { return []byte(fmt.Sprintf(format, args...)) } // 编写一个指定类型和原始内容的对象到 .git 中 // g := &Git{Dir: ".git", Branch: "main"} // hash, err := g.write("blob", []byte("hello\n")) func (g *Git) write(objType string, b []byte) (Hash, error) { b = append(g.fmt("%s %d\x00", objType, len(b)), b...) bz, err := zip(b) if err != nil { return nil, err } sum := sha1.Sum(b) hash := hex.EncodeToString(sum[:]) dir := filepath.Join(g.Dir, "objects", hash[:2]) obj := filepath.Join(dir, hash[2:]) if err := os.MkdirAll(dir, 0755); err != nil { return nil, err } return sum[:], os.WriteFile(obj, bz, 0644) } ``` 如果我们调用 `g.write("blob", []byte("hello\n"))`，它会创建一个我们之前计算过校验和的 blob 对象，之后我们可以通过 `git cat-file blob <hash>` 来读取它。 ## 初始提交是时候进一步，向新仓库进行第一次提交了。我们知道提交引用一个 tree 对象，而 tree 对象引用它们包含的 blob 对象。创建 blob 对象似乎相当简单： ```go func (g *Git) AddBlob(data []byte) (Hash, error) { return g.write("blob", data) } ``` Tree 数据包含文件权限（例如普通文件的 `100644`）、文件名以及它们内容的 blob 对象的哈希。这使得编写一个包含单个文件的 tree 对象也变得非常容易： ```go func (g *Git) AddTree(filename string, filedata []byte) (Hash, error) { hash, err := g.AddBlob(filedata) if err != nil { return nil, err } content := append(g.fmt("100644 %s\x00", filename), hash...) return g.write("tree", content) } ``` 最后一块拼图 – 提交对象。提交通常有一个 tree 引用（`tree <hash>`）、一些作者和提交者信息（`author John <[email protected]> 1670000000 +0000`）以及一条提交消息： ```go func (g *Git) AddCommit(filename string, data []byte, parentHash Hash, msg string) (Hash, error) { hash, err := g.AddTree(filename, data) if err != nil { return nil, err } parent := "" if parentHash != nil { parent = fmt.Sprintf("parent %s\n", parentHash.String()) } t := time.Now().Unix() content := g.fmt("tree %s\n%sauthor %s <%s> %d +0000\ncommitter %s <%s> %d +0000\n\n%s\n", hash, parent, g.User, g.Email, t, g.User, g.Email, t, msg) b, err := g.write("commit", content) if err != nil { return nil, err } return b, g.SetHead(b) } func (g *Git) SetHead(h Hash) error { return os.WriteFile(filepath.Join(g.Dir, "refs", "heads", g.Branch), []byte(h.String()), 0644) } ``` 在 `AddCommit` 方法的最后，我们将当前分支的 head 设置为结果提交的哈希。现在如果我们尝试使用这段代码进行提交，Git CLI 将能够在 `git log` 中显示它。但是如果没有合适的 `parentHash`，下一次 `AddCommit` 调用将覆盖前一次提交，历史记录中永远只有一个提交。让我们修复这个问题。 ## 历史记录提交是链式的。为了创建第二次提交，我们应该读取 `.git/refs/heads/main` 的内容，并将该哈希用作新提交的 `parentHash`： ```go func (g *Git) Head() (Hash, error) { b, err := os.ReadFile(filepath.Join(g.Dir, "refs", "heads", g.Branch)) if err != nil { return nil, err } return NewHash(b) } ``` 现在这使我们能够读取对象并从最新的提交（分支顶端）回溯到初始提交（没有父提交）。当然，在“真正”的 Git 仓库中，提交可能有多个父提交（例如在合并之后），但我们这里只考虑一个非常简单的单分支单文件仓库。为了根据给定的哈希读取对象，我们需要实现 `write()` 方法的逆向过程： ```go func (g *Git) read(objType string, hash Hash) ([]byte, error) { h := hash.String() dir := filepath.Join(g.Dir, "objects", h[:2]) obj := filepath.Join(dir, h[2:]) b, err := os.ReadFile(obj) if err != nil { return nil, err } b, err = unzip(b) if err != nil { return nil, err } if !bytes.HasPrefix(b, []byte(objType+" ")) { return nil, fmt.Errorf("not a %s object", objType) } n := bytes.IndexByte(b, 0) if n < 0 { return nil, fmt.Errorf("invalid %s", objType) } return b[n+1:], nil } ``` 我们从 `.git/objects/<前缀>/<后缀>` 读取文件，检查对象类型，跳过对象长度，然后返回剩余的对象内容。剩下的工作就是解析不同类型的内容来处理 blob、tree 和 commit。读取 blob 很简单，因为不需要解析： ```go func (g *Git) Blob(hash []byte) ([]byte, error) { return g.read("blob", hash) } ``` 如果需要考虑每个 tree 有多个文件，读取 tree 会稍微复杂一些： ```go type Tree struct { Blobs []Blob Hash Hash } type Blob struct { Name string Hash Hash } func (g *Git) Tree(hash []byte) (tree *Tree, err error) { b, err := g.read("tree", hash) if err != nil { return nil, err } tree = &Tree{Hash: hash} for { parts := bytes.SplitN(b, []byte{0}, 2) fields := bytes.SplitN(parts[0], []byte{' '}, 2) tree.Blobs = append(tree.Blobs, Blob{ Name: string(fields[1]), Hash: parts[1][0:20], }) b = parts[1][20:] if len(parts[1]) == 20 { break } } return tree, nil } ``` 这里我们在循环中解析 tree 内容并创建 blob 记录。我们并不读取 blob 本身，只存储它们的文件名和哈希。剩下的就是通过哈希读取 commit 对象，这样我们就可以实现 `git log` 了！ Commit 解析器与 tree 解析器非常相似，它逐行迭代内容，根据行前缀填充关于提交的信息： ```go type Commit struct { Msg string Parent Hash Tree Hash Hash Hash } func (g *Git) Commit(hash []byte) (ci Commit, err error) { ci = Commit{Hash: hash} b, err := g.read("commit", hash) if err != nil { return ci, err } lines := bytes.Split(b, []byte{'\n'}) for i, line := range lines { if len(line) == 0 { ci.Msg = string(bytes.Join(append(lines[i+1:]), []byte{'\n'})) return ci, nil } parts := bytes.SplitN(line, []byte{' '}, 2) switch string(parts[0]) { case "tree": ci.Tree, err = hex.DecodeString(string(parts[1])) if err != nil { return ci, err } case "parent": ci.Parent, err = hex.DecodeString(string(parts[1])) if err != nil { return ci, err } } } return ci, nil } ``` 尽管简单，这段代码即使对于复杂的 Git 仓库也应该能够工作，但它只会跟随单个分支，并且会忽略合并。要支持多个父提交，需要将父哈希追加到切片中。另一个可以尝试的练习是实现标签。标签类似于 `heads`（分支）——它们是 `.git/refs/tags` 内的文本文件，包含标签对象的哈希。我们刚刚实现的存储机制被称为“松散对象”。但还有一种更高效（也更复杂）的存储方式，称为“包文件”（packfile）。包文件是一个对象归档文件，类似于 tarball，其中一些对象可以作为与包中另一个对象的差异（delta）进行存储。解析包文件有详尽的文档，并不困难，但现在是时候结束我们的 Git 故事了。实现 `git init`、`git commit`、`git checkout` 和 `git log` 的完整示例 `nanogit.go` 可以在 [gist](https://gist.github.com/zserge/549317af15bc3aead966df462a7d5216) 中找到。它在不到 ~300 行代码中涵盖了大多数基本的 Git 概念：引用、对象、哈希以及存储系统。稍加努力，它可以改进为支持标签、多个文件、多个父提交，并最终成为你所感兴趣的编程语言的一个小型 Git 库。希望你喜欢这篇文章。你可以在 [Github](https://github.com/zserge)、[Mastodon](https://mastodon.social/@zserge)、[Twitter](https://twitter.com/zsergo) 上关注（或贡献），或者通过 [rss](https://zserge.com/rss.xml) 订阅。 *2022 年 12 月 4 日* 另见：[后启示录编程](https://zserge.com/posts/post-apocalyptic-programming/) 和 [更多文章](https://zserge.com/posts/)。

Git 由什么构成？(2022)

相似文章

[开源] 我用 Go 编写了一个完整的 Git MCP 服务器，不是简单地封装 bash。它使用了 tree-sitter，处理真正的底层操作（write-tree），并且 100% 本地运行。

Grit：用Rust和智能体重写Git

Git并不好

Git 2.54 亮点速览

Show HN: Gitdot – 一个更好的 GitHub。开源、反 AI、用 Rust 编写

提交意见反馈