Tracking files dependencies

kiryha · May 28, 2020, 3:43pm

Hi, Devs,

I am working on publishing the files for my pipeline.

I have a pretty static file structure typical for a common VFX pipeline:

In my case, publishing is just a record of the file version in the database. The published version becomes “final” and used in the further pipeline steps. E.g. if I have an asset rig of 016 version published, the render scene will reference this file. So far, so good, current shot state may look like:

But any file could be published several times, then the rest of the chain should be updated and hence the version dependencies would be different.

My goal is to be able to restore all input dependencies for any published version. For example, if I have a PNG render from Nuke version 002, I need to tell what version of Nuke script produce it, what version of EXR 3D render was loaded in that script, what version of the render scene produce this EXR, which asset version was loaded into the render scene, etc.

The current design is:

Each pipeline file is stored in a “files” table in the database.
I have a “file_snapshots” table to store published versions for any file.
The “snapshot_id” field of the “files” table linked to the “file_snapshots” table, so I can get/set published versions for any file.

Now I need to record all input versions for any published file somehow and I can’t find a solution.

My thoughts… I can create a table for each link (arrow on the pics) and at the moment of publishing retrieve all input versions for this file and record those links somehow with these tables…

I have a filling that I am overcomplicating the problem and there is a simple and elegant solution that exists.

Theodox · May 29, 2020, 5:23am

I’ve done something similar in SQL, and its doable but a bit awkard – sql isn’t a natural fit for graph structures.

If you’re feeling ambitious there are dedicated graph databases out there (that’s the sum of my knowledge – never used one!) . Maybe something like Neo4j ?

Alternatively, you could try something like treating each end product as a tuple of all the depencency-version pairs that make it up. So you’d have one table of just

ID    node   version

and another for the products which contained all the IDs for the dependency chain:

ID  product_name   date   dependencies

where dependencies contained all the IDs of the node-version pairs. Maybe you’d have some logic to prevent too long a recusrion into sub-sub-sub dependencies , or you could treat asembled depenencies as "pro

Where it gets nasty of course is that you have dozens or hundreds of small moving parts which version independently so the graph can ramify very quickly. Plus people don’t always track all the real dependencies: somebody tracks a script version but does not track a pip-install allong the way things go boom.

Another way you could think about it is as two separate problems: a DAG graph of dependencies with some functions for noting ‘dirtied’ graph nodes and triggering updates and a separate storage system that only knew about nodes and their update times. Maybe it’s worth looking at PDG?

mje11even · May 31, 2020, 12:13am

If you are going to want to do queries to find what files were generated from a particular file, then I think you are on the right track. However, if you just want to know the files that were used for a particular file, then something like xml data or even a log file is sufficient.

kiryha · May 31, 2020, 3:14pm

Hi, @Theodox ,
I will try to digest your suggestions and implement them. Thanks for your help!

Hi, @mje11even,
Yes, I just need to know what version (versions) was used as input (inputs) in a particular file, but not only direct inputs, inputs of inputs, etc (recursively) as well. I was also thinking about simple options, like writing all downstream dependencies in a text file (or string field in database). But since I have a lot of shots/files and any publishing action will require modification of these files… I still need some manageable system to query dependencies of any file which brings me to close to a solution with database and graph building/traversing.