Using DVC and Nix for asset version control

Git is the version control tool of choice for most people. One area it unforunately lacks in is storage of very large files, especially binary files. Git-LFS, the official solution to this problem, is sub-par and according to many anecdotes, not worth the trouble.

DVC is a good candidate for an alternative. Originating from the machine learning community, it has some odd quirks (content-addressing files via insecure MD5 hashes, analytics spyware), but nothing a couple patches can’t fix. Patching DVC to use SHA256 instead of MD5 and JSON for its .dvc files instead of YAML creates a workable tool for dealing with large files.

DVC uploads files to a configurable remote, such as an S3 bucket and leaves a .dvc file in the Git repo, containing the SHA256 hash of the file. The file is named by its hash on the remote, so knowing the hash as well as a base URL is enough to download it. DVC asset files can be fetched with Nix by simply reading the .dvc file as JSON and reconstructing the URL from the hash.

{ fetchurl }:

{ cdnURL ? "https://cdn.privatevoid.net/assets", index }:

let
  dvc = builtins.fromJSON (builtins.readFile index);

  inherit (builtins.head dvc.outs) sha256 path;

  hashPrefix = builtins.substring 0 2 sha256;
  hashSuffix = builtins.substring 2 (-1) sha256;
in

fetchurl {
  name = path;
  url = "${cdnURL}/${hashPrefix}/${hashSuffix}";
  inherit sha256;
}

This idea can be expanded to automatically replacing all the files endng with .dvc in a source tree with the assets as downloaded by fetchurl. As such, a source directory with DVC-managed files in it can be passed into a Nix build using something as simple as src = hydrateAssetDirectory ./.;.