Link Search Menu Expand Document

ASWF CI Working Group

Meeting: 3 March 2021

https://zoom.us/j/757849640?pwd=QzE1K2hrL2FHSFhKK3h5Z3BWTFJsZz09

Attendees

  • Daniel Heckenberg (Animal Logic, CI-WG Chair)
  • Jean-Francois Panisset (VES Technology Committee)
  • Aloys Baillet (Animal Logic)
  • Mark Boorer (ILM)
  • Jeff Bradley (Dreamworks)
  • Sean Looper (AWS)
  • Andrew Grimberg (Linux Foundation Release Engineering)
  • Brent Villalobos (Dreamworks)
  • Ryan Bottriell (Imageworks)
  • Andre Castelan Prado (ILM)
  • Michael Dolan (OCIO, Epic Games)
  • Federico Naum (Animal Logic)
  • Thomas Lam (Animal Logic)
  • Larry Gritz (Sony Imageworks)
  • Marshall Elfstrand (Apple)
  • Michelle Halliwell (Disney Animation)
  • David Aguilar (Disney Animation)
  • Christina Tempelaar-Lietz (OpenEXR, Epic Games)
  • Tiffany Fung (Pixar)
  • Sergio Rojas (Arena World)
  • Fabrice Macagno (Animal Logic)
  • Andrew Paxson (ILM)

Agenda & Notes

ASWF CI Goals for Year 3

  • Mac, Windows & Linux (New focus)

    • Different approaches for Mac / Windows than container based approach for Linux
  • Packaging / Distribution

    • Internal studio use, and support our projects and CI
  • Testing with commercial components

    • Haven’t made a lot of progress here

    • Testing plugins in Houdini, Maya, Nuke (building and running tests)

New Items

  • Mark Boorer to present at next meeting

  • LF RelEng had regular meeting with GitHub, ability to select custom runners in GitHub Actions coming next couple of months, will get ability to select GPU builders. Will be billed against organization, how do we put cost controls against that? Not fully fleshed out yet. Giving us access to “any flavor of Azure VM”. Also more flexibility for builds that need more cores / memory (OpenVDB). Should be available next quarter.

  • Docker at DreamWorks Animation (Jeff Bradley)

    • DreamWorks Tantalus: Using containers for artist environments

      • Containers can be thought of as virtual machines

      • Not used in production yet, but showing good signs

      • DreamWorks feature production environment

    • Our Feature Production Environment

      • CentOS 7

      • 3-5 shows at some stage of production. ~2 in heavy production for release in next year.

      • Typically a set of tools is fairly stable for life of the show. Major new internal and DCC tools will be adopted only shows further out. Eg VFX 2019 tools may not be used across a show until late 2020 (depends on dept.)

      • Rez (both build and run time)

    • Goals

      • Improve the software configurations for desktop software to:

        • Increase new software adoption speed

        • Reduce integration cost

        • Increase Quality

    • Goals - Quality

      • Rez is helpful but not a silver bullet

      • Loading libraries can be blurry between Rez packages and the system. Occasionally, we still find a tool or library running from the system, rather than the item from a rez package.

      • Incomplete Rez requirements are opportunities for incompatible combinations, which cause crashes in production.

    • Goals - Reduce Integration Cost

      • Open source isn’t Rez compatible out of the box

      • Adding a new VFX platform can take weeks due to complexities of the build systems (not counting code integration and updates)

    • Goals - Increase Adoption Speed

      • The “on my home machine…” bar. I can install on my home machine in a couple of hours. But obviously, we have the VFX matrix.

      • Decouple from OS, eg Move to CentOS 8, while artists still using CentOS 7 workflows. Or, move to new tools on new OS prior to studio moving the base OS.

    • Goals - Cloud

      • Hey, what about this “cloud” thing?

      • Ensure reproducibility of software in the cloud

      • Work for all software deployments: on-prem, hybrid, cloud

    • Approach

      • Containerization Process

        • Tantalus tools use Rez to determine the set of packages needed for the proper behavior of an Artist workflow. Rough Layout, in this hypothetical example.

        • These Rez packages re copied into a Docker image based on NVIDIA CentOS 7 image

        • Federico: what is the level of granularity of Rez packages? What about libtiff? Jeff: want to try to pull as many of those into Rez packages so they can be controlled outside of the OS, and help moving to a new OS version. Ran into that issue when moving from RHEL 6 to CentOS 7, had to recompile everything just because of a new version of libtiff, big effort with a large software footprint, also creates testing / QA challenges. So trying to move to a higher level of granularity. Libtiff is clearly applicable, but what about PCRE, libz? May have to adapt over time. Federico: went that route of Rez-ifying every library, didn’t really work out, backed out to try to rely more on the CentOS base image (VFX Platform). Brent: what caused wanting to move away from Rez packages? Federico: still using Rez but not for low level libraries, ran into lots of conflicts, DCC apps would still require other versions, when AL went from CentOS 6 to CentOS 7, went back to non-Rez OS components. Aloys: create a “meta” Rez package that defines all the yum packages that the system should have. So have a CentOS 7.4 Rez package, can validate OS using Ansible. It’s not foolproof. Federico: idea is to replace this with a container. Jeff: yes, getting this far is the challenge, many say “this isn’t what Docker is for”, so can be difficult to convince people that the approach is worthwhile. Andre: using Rez at Method, using Containers for services, but haven’t put libraries in containers, assuming base libraries are part of the OS (or the container). Jeff: have to keep stability for a very long time, some depts move at different speeds, some artists use multiple applications with conflicting requirements. The right solution is likely to be different for everybody.

    • Approaches

      • Docker Wrapper tools

      • Build time

        • Aggregate Rez packages

        • Repair missing dependencies

        • Capture Rez environment in a shell script

        • Create a docker image

      • Runtime

        • A rez-like command tantalus_run rough_layout -c “launch_app -show=dragon”

        • Mount interesting volumes ($HOME dir)

        • Create matching UID within container

        • Execute the shell script to intiialize the “Rez environment”

        • Run a command or drop user on the command line

      • “Poke holes” in the container to access license server

      • Mount unavoidable volumes (assets)

    • Results

      • Performance:

      • Several artist workflows showing normal behavior - performance, connectivity, etc

      • Render tests within 1-2% compared to bare metal. Still investigating, shouldn’t have any drop

      • Studio environment on a thumb drive. Not specifically an original goal, but great value.

      • Not an initial goal: how to demonstrate software at conferences, how to run software on a disconnected laptop. One of these containers could now be used for that. See value to being able to encapsulate studio environment. What about implications for low latency work from home? Yes, could be useful, but not tried yet. Also helps for projects that require “difficult” USB peripherals.

      • Mark: what about the “repair” phase, manual or automated? Jeff: automated process driven by configuration text files, listing additional libraries required. Per package fixes pulled in during automated phase. Currently #1 goal to fix up Rez packages, will help this project but also the whole “Rez based” studio even before containers are adopted. Discovering a lot of places where dependencies were “not quite right”, containerized environment helps to highlight problems with Rez packages that would previously “just work” in a typical environment, picking up system dependencies.

      • Federico: are NVIDIA drivers embedded in container? Jeff: yes, using CentOS / NVIDIA containers. Federico: any issues with sound drivers? GFX frame rate? Jeff: the 1-2% dropoff was talking about rendering performance, haven’t experimented with sound much yet. Still at early stages. GFX seems OK, seems to work OK sharing drivers with the underlying system. Audio is always a challenge. Federico: do the developers use these Docker containers to develop?

    • Challenges

      • We need to be smart and creation on the architecture and deployment design: containers come from the microservice world. We aren’t going 100% microservices for production, so many of their cool approaches to quick and agile deployments don’t apply.

      • Rapid adoption of changes: Rez provides a flexible environment. Containers are less flexible. Fortunately, a more deliberate approach works for our official studio policy of increasing stability and quality. Some users may be frustrated. Can we have it both ways?

      • Early on a project, TDs may want to do rapid development, but once software gets into the hands of artists, want stability.

      • When to create images in release and adoption process? A new image with every new package is too much churn. TBD. Don’t want thousands of images lying around that won’t get used.

    • Possible Next Steps

      • Clean up Rez and lower level dependency declarations

      • Add “Tantalus” info into the Rez package definitions

      • Leverage ASWF Docker images

      • Support development environments with containers (which almost requires we solve the “Rapid adoption of changes” from the previous slide)

      • Docker layers: maybe 3-4 layers for the OS, VFX Reference Platform, DCC/large app, and everything else? Performance impact of local Docker container cache vs loading across NFS (interactive startup of large application, 3-4 minutes vs 30 seconds).

    • Ryan: What about security implications of Docker installed everywhere? Jeff: In general Docker not installed on every workstation, need up to date version for GPU support. Ryan: can use Podman to address some issues, but brings some issues. Also on current CentOS 7, adds a performance hit. Federico: looking at Singularity which doesn’t have a lot of these issues, have a single executable that can be run by any user without root privileges, no performance hit. Ryan: performance issue with overlayfs version included with CentOS when using Podman. Jeff: have looked at that level of detail, have looked at Singularity which does a lot of the security “for you”, hoping to use layers more, stuck with a 12GB Singularity image. Limit of 126 layers in Docker, so can’t use “one layer per Rez package”.

  • TAC followup: Further CI-WG formalization, or a project? (Daniel)

    • Daniel: discussion as how this group is positioned, want to split it out as self-contained WG. This was agreed to and is slowly happening. Kimball asked whether we should go a step further and create projects around the artefacts we are producing so we can leverage some of the processes offered to ASWF projects. Should we have a short discussion? Can discuss it on the #aswf-ci Slack channel.

    • Lends an opportunity to create discoverability for the scripts around the Docker images for instance (get others to contribute). Is the full “weight” of a ASWF project structure appropriate for our efforts?

Follow ups

  • ASWF-docker updates (Aloys)

    • VFX Platform 2021 status

      • What OpenEXR versions? Aloys: not a lot of updates from last month, only small changes were for OpenCue, updated Java version, Brian was first user of PyPi distribution of aswfdocker utility, helped him with some workflows, but realized it wasn’t all the way there. Utilities to build, trigger releases. But version.yml file comes from PIP package, if you have a local modified version, need to find a way to prefer to modified local version to the one bundled with the PIP package. Need to make this logic easier so more people can download aswfdocker utility.

      • Tried to build OpenEXR 3, lots of activity.

      • Maybe a version of Docker build containers with “polluted” dependencies

      • Some PRs for Windows builds. Some efforts to run Python utility on Windows, and run tests on Windows. As long as you have Docker running, as well as WSL, might be able to do some work there. Waiting to upgrade Windows version to get WSL to test. Initial goal is to build a Linux ASWF container on Windows, not a Windows container. Useful, but not the “end goal”. We should start looking again at actual Windows containers at some point (time / resources / motivation).

    • DockerHub update

    • Dependency management in containers (Larry)

      • OpenEXR 2 & 3

      • Python 2 & 3

      • Larry: noting that in OIIO there was breakage on systems that had OpenEXR iMath 3 on a system installed with OpenEXR iMath 2 (which would get found by CMake). Our Docker images are a “idealized” setup where there is only the exact versions we want. Our typical studio environments typically aren’t that ‘clean’ / ideal. Not sure there’s an action item out of this, but can get a “misleading view” of whether a build will work in a Docker environment. “Build environment fuzzing”? Don’t get me wrong, very thankful for all the work going into ASWF Docker containers.

      • Ryan: unsolved problem area. Most Linux distros don’t guarantee anything if you install outside the package manager. Mark: patching and maintaining on every distro is a lot of work. Daniel: other systems to hot swap from the primary system. A fair point: robustness of our software for general use, if you have functionality to find headers and libraries in our Cmake recipes, make sure you end up with a consistent set. But where does the responsibility lie? Ryan: do we have a “getting started” document for our containers? Larry: the people using our containers are the ones in a safe environment with guardrails, the issue for people trying to build in messy environments. Build systems are complicated, things turn up in the real world we don’t test against.

  • GPU Build & Test

  • Mac CI

  • JFrog

    • Andrew / Aloys: Looking to get an Artifactory instance set up. Andrew: when JFrog made announcement of changes to their platform, seem to have turned down their open source program. Having issues acquiring an Artifactory Cloud instance, still working on it (LF has another project that wants it), but no updates to share.
  • Project feedback

Action Items

Next Steps