Link Search Menu Expand Document

ASWF CI Working Group

Meeting: 31 March 2021

https://zoom.us/j/757849640?pwd=QzE1K2hrL2FHSFhKK3h5Z3BWTFJsZz09

Attendees

  • Daniel Heckenberg (Animal Logic)
  • Jean-Francois Panisset (VES Technology Committee)
  • Aloys Baillet (Animal Logic)
  • Andrew Grimberg (Linux Foundation Release Engineering)
  • Larry Gritz (SPI)
  • Mark Boorer (ILM)
  • Patrick Hodoul (Autodesk / OCIO)
  • Simran Spiller (Olive)
  • Andre Castelan Prado (ILM)
  • James Salter (Laika)
  • Jeff Bradley (Dreamworks)
  • Ryan Bottriell (SPI)
  • Ryan Bujnowicz (Pixar)
  • Michelle Halliwell (Disney Animation)
  • Dan Bailey (ILM)
  • Sergio Rojas (Arena World)

Agenda & Notes

  • Any news from JFrog about OpenSource account for ASWF? (from Aloys)

    • Andrew: RelEng got in touch with JFrog, still trying to re-establish contact since someone who was the contact left the company
  • How do we make it clear that package management is in scope of this WG. Larry: Build / Test / Package / Deploy suite of tooling, CI was just the first step we had to tackle.

ASWF CI Goals for Year 3

  • GPU Build & Test (success!)

  • Mac, Windows & Linux (New focus)

  • Packaging / Distribution

  • Testing with commercial components

  • Formalization of this WG

New Items

  • Build System: bb ILM (Mark Boorer) (there will be a recording)

    • A prototype at ILM, not currently used in production, but leaning towards the system they want to be using, embodies lessons from other systems tried in the past.

    • Mark Boorer: color scientist and build engineer at ILM

    • Trying to solve issues with current system

    • Presentation Notes

  • Bb Overview

    • Why: what problems does it solve

    • How:

      • High level overview

      • Package definitions

      • Target definitions

      • What does it look like to interact from a developer perspective

    • Live Demo

    • Q&A

  • Why?

    • Manual software building is painful

      • Often require patches that need to keep up with upstream changes

      • Runtime is potentially unstable (incorrect LD_LIBRARY_PATH or RPATH)

    • Other systems aren’t feature complete

      • NixOS (yum, apt…) exposes only a single build tree, a single variant of everything, big changes are invasive, no support for running conflicting of libraries / components

      • REZ’s resolver requires manual developer tuning, overhead in beaking down configuration into config files

      • Conda is slow and relies on patching binaries for $ORIGIN lookups

  • Extensive investigation of existing solutions

    • ~100 user stories from various departments (R&D (C/C++), PipeTD (Python, more comfortable with live patchings), Prod (not developers, but specify configuration for shows)). Also IT / Tech teams (system configuration, drivers)

    • Main focus on bb, Bazel (Google), Buck (Facebook), Conda, Rez, and our current system

    • Built a “traffic light” breakdown for each system against user stories

  • How? High level overview

    • Package build instructions are described via executable / scripted code (js/lua/?) and ran in chrooted isolation

    • The collection of package build instructions for the entire company is maintained as a single git repo - the “deployment”

    • The deployment also contains the list of final execution environments (targets) that we are using (maya, zeno, nuke, etc)

    • Developers interact with the deployment to build new environments, update existing ones, all in isolation. No more testing in production. Results of the deployment (artifacts) are put on disk. Cannot merge breaking changes to the deployment. Master branch of the repo is the current set of software.

    • Make sure that all of the targets that have been asked for are always built and available

    • Merges to the deployment are predicated on all targets building successfully, and all unit/integration tests passing

    • Built packages are stored in immutable, hashed locations

    • Build artefacts are re-used when targets have overlapping, binary identical dependencies and only diverge when necessary. “Maya for show A” and “Maya for show B” are likely to reuse common plugins and libraries, no duplication are binary assets, don’t need to copy libraries into multiple places.

    • Build artefacts can be stored in local and remotely shared locations, increasing the opportunity for artefact re-use and therefore decreasing build time

  • Package Descriptions

    • Defined as a function

    *

       function openexr(ctx, params) {
          set_build_environment(ctx, "cpp_toolchain", “ilmbase”, “zlib”);
          get_source(ctx, params, “openexr”);
          do_autotools_build(ctx, {“prefix”:”/usr”});
          set_runtime_environment(ctx, “ilmbase”, “zlib”);
       }
       registry.register_fn(“openexr”, openexr);
    
    • do_autotools_build() is an example of packaging the steps needed to configure, build and install a package that uses GNU autotools

    • Difference between tools you need at build time (CMake for instance) versus what you need at runtime

  • More complicated example for openimageio

    • URL you download from has different name than the resulting artefact

    • Example of how CMake is used

    • Builds every version of OIIO required

    • Keep track of arguments used to build specific Boost version

  • findutils()

    • Every step in build function generates a shell script on the fly

    • Lowest level is to raw inject shell code into the generated shell script using ctx.shell()

    • Anything you can write as a bash script, you can do in bb, but can refactor your code into nicely readable code

  • How? Package Descriptions

    • Are programmable code, so repetitive patterns can be refactored out into functions

    • Allow for complex logic to be expressed (such as V.01 uses autotools, V.02 needs CMake, etc)

    • Could be expanded to parse definitions from static files as well

    • Are small and fast enough to be executed during the graph evaluation, in parallel

    • Minimal version checking. Version numbers are only used to trigger different build instructions (add additional dependencies, or enable / disable features)

    • If a package won’t work with a random version of its dependency, then we prefer that package to just fail to build in that configuration. The most frequent developer interaction with a package is to update its version! (no need to add explicit checks for “version greater than”).

    • Failed builds cannot be deployed! (PRs won’t merge)

  • How? Target Definitions

    • List of final environments that must be built by the deployment

    • Input the package parameters to drive the various combinations of artefacts (usually version numbers or compile options)

    • These parameters can be nested and derived from to allow for minimal copy-pasting

    • Are also created by executable code, allowing for loops or other more complicated definitions

  • Example of what a build context would look like

    • Global context: default versions

    • Python3 context derived from global one

    • Reference Platform context

    • Nuke platform context that inherits from VFX Reference Platform

    • Register the names of the targets

  • How? Developer interaction

    *

       $ bb build gcc=9.1.0 openexr=2.3.0 python=3.7.3
       Build successful! /artefacts/local/world/aa0e217
    
    • Downloads the latest copy of the deployment

    • Overrides the default versions for the given packages

    • Evaluates all the builder functions and checks hashes

    • Issues build commands in dependency order

    • Returns the path to the built environment

  • More example

    *

       $ bb world /artefacts/local/world/aa0e217
    

    *

       $gcc --version
       gcc (GCC) 9.1.0
       ….
       $
    
    • Reads the world package provided and launches commands (defaults to bash)

    • Has controlled interaction with the outside world

    • Basically a Docker-like environment (use the unshare() system call)

  • Technical Implementation

    • Most functionality is exposed via libbb, a self contained C library with very minimal external dependencies (only libm, libc, etc)

    • The responsibility of entering environments is managed solely through the bb world command, meaning different platforms can have different implementations (eg containers, LD_LIBRARY_PATH, Hyper-V, Hypervisor.framework)

    • In the current Linux implementation…

  • Why Javascript / Lua?

    • Build functions need to be evaluated often

    • The have very few interdependencies and would be fastest if executed over multiple threads

    • Most scripting languages are not implemented in a thread safe fashion (including Python)

    • Would love to use Starlark (a Python written by Google dialect used in Bazel), but implementation only exist in Go, Rust and Java

    • Support for other languages can be easily added in the future

  • Possible Workflows

    • No blocked releases: developers attempt to release their changes immediately, and rely on extensive automated unit testing / integration testing before deployment to catch bugs. Production has the ability to roll their entire show back to a known good point in time in the event of error.

    • Small batch testing: developers can make temporary environments available to selected artists for testing, without worrying about impacting the rest of production, or having the test versions leak to other users.

    • Easy off-site deployment: as the environments are utilizing Linux containerization primitives, exporting from the build system as a docker container or similar would be possible, easing laptop or independent server deployment.

    • Extremely simple OS updating: as the build system only relies on userspace of the Linux kernel underneath, upgrading the OS is very simple. CentOS 7-8 migration would be seamless.

    • Immutable and Reproducible: the internal hashing and read-only nature of the build system artefacts make for a great combination with asset management systems. Renders could store a dependency on a fixed moment in time of our software state.

    • Isolation and testability: because every change to the deployment happens in isolation, it is possible for developers or IT support to try out massive changes without fear of breaking production, for example updating the compiler

    • Easier debugging: developers have the ability to rebuild every package in their hierarchy in debug mode at the press of a button

    • Easier troubleshooting: In DCCs with many plugins, it can be difficult to determine who is at fault for segfaults and crashes; bb makes it easy to build custom environments such as “Nuke with only 3pp plugins”, “RV without GLSL nodes”, “Maya with only this one plugin loaded”

  • Q&A

    • Ryan Bujnowicz: what about issues with unshare() vs NFS / autofs? Mark: calls to unshare() are aon the local machine, the root is local to the system, with NFS share points below it, which seems to work. System has contact of local and remote packages. Have heard horrors stories with NFS, but so far haven’t hit too many limitations. Other limitations with unshare() is the way the you pass data to a mount, can hit limit in amount of data you can pass to mount API. New API in more recent kernels, but limited in kernel version in CentOS 7. Building the system towards future kernel capabilities.

    • Ryan Bujnowicz: any thoughts on going further with cgroups for controlling GPU access for instance? Mark: use cgroups, but not trying to limit hardware, instead “pretend” the system is completely standalon and has access to everything. Using same system codes as Docker, could add ways to limit resources (number of cores). Tend to treat every machine as a dumb collection of CPUs rather than custom configurations. Bb has concept of “platforms”, all CPUs with SSE instructions (say) could be treated as a platform, CPUs without them as a separate “platform”. Would build against both platforms, so you can’t necessarily take an artefact from one platform and use it on another platform. This is invisible to the user.

Follow ups

  • CI Working Group becoming an official working group

    • Splitting out CI WG into its own repo

    • Wiki

    • Anything else?

    • Action item for Daniel

Action Items

Next Steps