Recent comments posted to this site:
git-annex looks at the file's stat() and only if the device id is the same
as the stat of the destination directory does it use cp
. If you see it
running rsync
instead, it's under the perhaps mistaken impression that
it's a cross-device copy.
The bup special remote does exist, so if you want to use that special remote, you can get efficient storage and transfer of related versions. It would probably be possible to make bup use the same git repo as git-annex, just storing its data in a separate branch, but I have not tried it.
are there plans to have chunks stored in the regular backend storage?
i'm curious because one my use cases is archiving websites, where we end up with lots of WARC files. those files are basically a bunch of files from the website gzipp'd together in a stream, which means that multiple crawls of the same website (or actually, different website) have lots of redundant data (e.g. jQuery.js). storing those files in git-annex is not very efficient, because that data is duplicated all over the place.
if the storage backend was chunked, there could be massive deduplication across those files... this is why i looked at the borg special remote: I figured that i could at least deduplicate on the remote side, but it would still be nice to have this built-in! -- anarcat
I think that annex always uses cp --reflink=auto
for local paths (cache remote was on a local path right?). I guess running with --debug
could have helped to resolve the mystery ;-)
BTW -- checked locally - reflink=auto
seems to work nicely across subvolumes of the same BTRFS filesystem. "copying" gigabytes takes half a second or so ;-) (without reflink=auto - takes considerably longer)
Well, I did not check reflink=auto. I just checked first with git annex copy --to=cache
which simply duplicated every file. So there are two posibilities:
- reflink=auto doesn't work (I did not check but I think it work)
- git-annex does not recognise the other path to be on the same filesystem so went back to simply copy (very likely but joey has to check.)
Hi Mowgli, could you please elaborate for a slow me -- are you saying that --reflink=auto
is not causing CoW between different subvolumes of the same filesystem while --reflink=always
does?
P.S. glad to see more of BTRFS & git-annex tandem users around ;-)
It doesn't work well if the source of the copy is in a btrfs subvolume and the cache is in another subvolume of the same filesystem.
With that setup every file is really copied instead of using reflink=always.
I solved it currently by copying .git/annex/objects manual into the cache (cp -a --reflink=always .git/annex/objects ~/.cache/annex/
and afterwards doing the git annex cp which recognice the existence of the objects.
The scenario that isStableKey is being used to guard against is two repos downloading the content of an url and each getting different content, followed by one repo uploading some chunks of its content and then the other repo "finishing" the upload with chunks of its different content. That would result in a mismash of chunks being stored in the remote.
It's true that it could also happen using WORM with an url attached to it. (Not with other types of keys that verify a checksum.) Though it seems much less likely, since the file size is at least checked for WORM, while with URL keys there's often no recorded file size. And, WORMs don't typically have urls attached (I can't think of a single time I've ever done that, it just feels like asking for trouble), while URL keys always do.
If this is a serious concern, I'd suggest you open a todo or bug report about it, there are far too many comments to wade through here already. We could think about, perhaps not allowing download of WORM keys from urls or something like that..