Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
648 views
in Technique[技术] by (71.8m points)

linux - Deduplicate Git forks on a server

Is there a way to hard-link all the duplicate objects in a folder containing multiple Git repositories?

Explanation:

I am hosting a Git server on my company server (Linux machine). The idea is to have a main canonical repository, to which every user doesn't have push access to, but every user forks the canonical repository (clones the canonical to the user's home directory, thereby creating hard-links actually).

/canonical/Repo /Dev1/Repo (objects Hard-linked to /canonical/Repo to when initially cloned) /Dev2/Repo (objects Hard-linked to /canonical/Repo to when initially cloned)

This all works fine. The problem arises when:

Dev1: Pushes a huge commit onto his fork on server (/Dev1/Repo) Dev2: Fetches that on his local system, makes his own changes and pushes it to his own fork on server (/Dev2/Repo)

(Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically.)

This is eating up my server space like crazy!

How can I create hard-links between the objects that are duplicate between the two forks or canonical for that matter, so that server space is saved and each developer when cloned from his/her fork on his/her local machine gets all the data?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically

Actually, with Git 2.20, that issue might disappear, because of delta islands, a new way of doing delta computation so that an object that exists in one fork is not made into a delta against another object that does not appear in the same forked repository.

See commit fe0ac2f, commit 108f530, commit f64ba53 (16 Aug 2018) by Christian Couder (chriscool).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
See commit 9eb0986, commit 16d75fa, commit 28b8a73, commit c8d521f (16 Aug 2018) by Jeff King (peff).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit f3504ea, 17 Sep 2018)

Add delta-islands.{c,h}

Hosting providers that allow users to "fork" existing repositories want those forks to share as much disk space as possible.

Alternates are an existing solution to keep all the objects from all the forks into a unique central repository, but this can have some drawbacks.
Especially when packing the central repository, deltas will be created between objects from different forks.

This can make cloning or fetching a fork much slower and much more CPU intensive as Git might have to compute new deltas for many objects to avoid sending objects from a different fork.

Because the inefficiency primarily arises when an object is deltified against another object that does not exist in the same fork, we partition objects into sets that appear in the same fork, and define "delta islands".
When finding delta base, we do not allow an object outside the same island to be considered as its base.

So "delta islands" is a way to store objects from different forks in the same repository and packfile without having deltas between objects from different forks.

This patch implements the delta islands mechanism in "delta-islands.{c,h}", but does not yet make use of it.

A few new fields are added in 'struct object_entry' in "pack-objects.h" though.

See Documentation/git-pack-objects.txt: Delta Island:

DELTA ISLANDS

When possible, pack-objects tries to reuse existing on-disk deltas to avoid having to search for new ones on the fly. This is an important optimization for serving fetches, because it means the server can avoid inflating most objects at all and just send the bytes directly from disk.

This optimization can't work when an object is stored as a delta against a base which the receiver does not have (and which we are not already sending). In that case the server "breaks" the delta and has to find a new one, which has a high CPU cost. Therefore it's important for performance that the set of objects in on-disk delta relationships match what a client would fetch.

In a normal repository, this tends to work automatically.
The objects are mostly reachable from the branches and tags, and that's what clients fetch. Any deltas we find on the server are likely to be between objects the client has or will have.

But in some repository setups, you may have several related but separate groups of ref tips, with clients tending to fetch those groups independently.

For example, imagine that you are hosting several "forks" of a repository in a single shared object store, and letting clients view them as separate repositories through GIT_NAMESPACE or separate repositories using the alternates mechanism.

A naive repack may find that the optimal delta for an object is against a base that is only found in another fork.
But when a client fetches, they will not have the base object, and we'll have to find a new delta on the fly.

A similar situation may exist if you have many refs outside of refs/heads/ and refs/tags/ that point to related objects (e.g., refs/pull or refs/changes used by some hosting providers). By default, clients fetch only heads and tags, and deltas against objects found only in those other groups cannot be sent as-is.

Delta islands solve this problem by allowing you to group your refs into distinct "islands".

Pack-objects computes which objects are reachable from which islands, and refuses to make a delta from an object A against a base which is not present in all of A's islands. This results in slightly larger packs (because we miss some delta opportunities), but guarantees that a fetch of one island will not have to recompute deltas on the fly due to crossing island boundaries.


A side effect though: some commands were more verbose. Git 2.23 (Q3 2019) fixes this.

See commit bdbdf42 (20 Jun 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a4c8352, 09 Jul 2019)

delta-islands: respect progress flag

The delta island code always prints "Marked %d islands", even if progress has been suppressed with --no-progress or by sending stderr to a non-tty.

Let's pass a progress boolean to load_delta_islands().
We already do the same thing for the progress meter in resolve_tree_islands().


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...