Link Search Menu Expand Document

ASWF CI Working Group

Meeting: 1 April 2020

Attendees

  • Daniel Heckenberg (Animal Logic, TAC Chair)
  • Jean-Francois Panisset (VES Technology Committee)
  • John Mertic (Linux Foundation)
  • Andrew Grimberg (Linux Foundation RelEng)
  • Bernard Lefebvre (ADSK)
  • Christina Tempelarr-Lietz (OpenEXR)
  • Gordon Bradley (Autodesk)
  • Michael Dolan (OCIO)
  • Sean Looper (AWS)

Agenda & Notes

ASWF CI Goals 2020 [0:00-0:10]

  • Linux, Windows, Mac platform support

    • Best support on Linux

    • Windows, Mac lagging behind, some of the projects have their own ways of doing things, but no reuse / shared approach

  • GPU CI support

    • Lots of progress in principle but not in practice
  • CMake best practice

    • Best practices shared between projects
  • Top of tree builds for classic CI workflows

    • Expanding the types of builds as we migrate to GitHub Actions
  • VFXPlatform 2020 / Python 3

    • Using our projects as opportunity to share best practices and expertise, could be a benefit to other projects / studios

    • Some kind of support in CI system for Python 2 and 3 builds

    • Offering our platform Docker images with Python 3, do we also support Python 2 for VFX Reference Platform 2020? Does the 2019 platform stay relevant for longer to continue supporting Python 2.7? Or VFX Platform 2020 with Python 2.7 to simplify build matrix?

  • Gordon: these goals and priorities have been around for a while, how should they be positioned now? Daniel: led by what the projects have been asking for the most, going forward we may want to focus our activities based on what the group thinks we should be doing going forward. Main priority remains GPU support in CI system (OCIO and OSL need this). Next priority (based on survey feedback) is Windows / Mac platform support. Bradley: agreed. A typical problem is not knowing when a project is over, “we could do good stuff all year, but are we doing the right stuff”?

  • A future goal is deliverables directly to various package managers (Conda / Artifactory /…).

Follow ups: [0:10-0:30]

  • GPU-enabled CI

    • Management of use of dynamic instances from non-free pool

      • See discussion at last TAC meeting

      • Daniel: We have had a working prototype for Azure Pipelines and GitHub Actions, can spin up a GPU instance based on external event (PR), left to figure out practical and organizational issues. Specifically how do we manage the costs, how frequently do we run the tests. And how do we make policy decisions robust to implement and track. Also need to make the number and cost of jobs visible and manageable to keep track of how much we are spending, even if these are credits provided by CSP foundation members (needed by LF RelEng Team). Specifically we need a ceiling on resource / cost consumption.

      • Andy: need to decide on a few things before we can move on a solution. 1) Azure Pipelines or GitHub Actions? Determines which of the PoCs we will be using. 2) Which CSP are we going to use (PoC supports AWS / Azure / Google). We want to pick one to start with, eventually we can support multiple clouds to optimize costs.

      • JF: Should we target getting this working on GitHub Actions since that’s the general direction our CI is taking, or stick with Azure Pipelines? My sample repo supports both, and I would personally favor GitHub Actions (and that becomes an additional carrot to move projects to that CI if they want GPU support). Michael: haven’t moved to GitHub Actions yet, but planning to switch in the next month or so. Christina: also planning to move to GitHub Actions. -> Decision is to go forward with GPU on GitHub Actions only.

      • Daniel: to answer the second question, since Sean is on the call and has been active on this topic, suggesting we go with AWS. Sean: agrees. Andrew: most expertise at RelEng is OpenStack and AWS, then Google, then Azure. -> let’s go with AWS.

      • Sean: any support that AWS can offer from a build / CI/CD point of view, happy to provide expertise from AWS resources. Can reach him directly by email.

      • JF: something perhaps worth exploring is the ability to restrict what a specific Cloud Provider token is allowed to do, so that even if a credential got stolen by a malicious PR that was approved and looked fine at first, there’s only so much an attacker could do with it before it got revoked. By default all 3 cloud providers require specific approval through a helpdesk ticket before allowing you to allocate GPU instances, and the default limit is set to 1. CSPs typically have pretty sophisticated permission models, so it may be possible to have CSP credentials / tokens that are only allowed to create and access specific types of infrastructure. Andrew: yes, RBAC can be used to circumscribe what can be done with a set of credentials / tokens.

      • Andrew: with GitHub Actions have to do per-repo credentials, so need a RelEng support desk ticket to set up the required infrastructure. Also need AWS instance set up with correct access. Issue with AWS is that they have a multi-organization account, so credits could not be targeted, but apparently that’s no longer an issue, and credits can be targeted to a sub-organization (learned this last week). Sean: confirmed, Andrew to provide account number to Sean to provide the credits. Andrew: LF has been asking for a lot of credits… Sean: Studio Tech team has more leeway.

      • Daniel: Which project to target? Is OCIO a good project? Michael: should be good, transition to GitHub Actions will be put to “top of list”. Daniel: coupling two new things together always a risk, but should hopefully be OK.

      • Daniel: what would be a reasonable deadline? By next CI meeting in 4 weeks? Michael: will create LF RelEng ticket to get setup going for OCIO repo.

  • Docker image build updates

    • VFX2019 + USD, which version(s)? Aloys’ post

    • Christina: as long as we have some image we can run Python 3 against, OpenEXR will be fine. Either separate Python 2 / 3 images or a single combined image, either will work. Having them separate means matching the VFX Platform spec.

    • Michael: OCIO has outstanding PR for Python 3 support.

    • Daniel: looks like we are happy to keep a single set of versions via the VFX Reference Platform. We move forward with Python 3 with the VFX Reference Platform 2020.

    • What about USD inclusion? We can’t directly influence when USD will support Python 3. Which version of USD should be bundled with our 2019 containers? Maybe more interesting to people depending on USD, none of our projects depend on USD yet.

    • Gordon: no simple answer. Internally struggling with internal software, Maya, Arnold, Bifrost, even getting those to agree on a USD version is tricky. Want to leave clients to build their own version, so the shipped version (from VFX Reference Platform) can be replaced. Also complexity of shipping both Python 2 and Python 3 support, trying to make that a single install. Still a long way from standardizing USD adoption schedules. At some point USD evolution may slow down, but seems like we haven’t reached that point yet, so being able to replace it is important. Does that match other people’s experience?

    • Christina: were trying to run both the Python 2 and Python 3 tests in a single build, but that’s not necessary, can run separate build / test cycles.

    • Daniel: USD is moving very quickly, so most users are interested in moving USD forward quickly, especially if you have data that’s linked to a newer USD version.

CI Updates for Projects [0:30-0:40]

  • SonarCloud quality gates: https://sonarcloud.io/organizations/academysoftwarefoundation/projects

    • Only OpenCue currently passing

    • Not all projects have the scans going

    • What should we do to get value out of them

    • Michael: Patrick Hodoul has been putting in PRs to resolve some of the issues, but new ones keep popping up. Daniel: is this a useful process that generates meaningful improvements to the code base? Michael: yes, does help to clean things up, some issues are subtle, but someone gave time to thinking about what gets included. Can be a challenge when doing active development. As projects stabilized hopefully we can focus on that more.

    • Should project TSCs have a “formal” discussions of what SonarCloud warnings should be disabled for a specific project. Daniel: the configuration already lives in the repository, so changes to the config should go through the usual PR process for a project. Already gives a level of scrutiny to those changes. Perhaps changes to these files should be made independently to other changes, could be an add PR-level rule. Alternatively could be some kind of policy-mandated for TSC review.

    • Michael: quality gate can be customized, but it’s configured in SonarCloud itself, not in the repo. OCIO is using the default one, not customized. Discussed commenting in the source code to identify changes made specifically to please the SonarCloud quality gate.

    • Christina: haven’t configured the Quality Gate, and adding comments to the code. Marked a few that weren’t going to fix.

    • OpenVDB is running SonarCloud, but in a separate organization.

    • Michael: nightly SonarCloud run if there’s been a commit, and a weekly unconditional run.

CI Platform [0:40-0:50]

  • GitHub Actions / Azure Pipelines

    • Already addressed in context of GPU

    • Not enough project representation to discuss a timeline for transition, but we’ve broadly decided to move on to GitHub Actions. Leverage knowledge between projects.

CI System Reusability

  • Gordon: this came up internally while working on Maya USD plugin, what kind of CI could be put around that. One of the Action Items was to go look at the ASWF CI to see if that could be reused. Unfortunately it JF:wasn’t clear what we are currently doing, it’s not really documented, and instead embedded in the projects / tribal knowledge. How can we help others to leverage this information.

  • JF: The Sample Project is an attempt to answer these questions: https://github.com/jfpanisset/aswf-sample-project

  • JF: need to open ticket with LF RelEng to move the repo to the ASWF organization, and needs to be updated to be current.

  • Andrew: will add a note to the Jenkins-based repos to mark that this is no longer used.

  • Gordon: we should capture our CI decisions in a higher level document, and should be linked from the ASWF web site.

Action Items

  • MD: OCIO to GitHub Actions, LF RelEng ticket for GPU CI

  • MD, AG, JF, SL: GPU CI setup for OCIO on Github Actions + AWS

  • JF: Move sample project repo to ASWF org

  • AG: Update CI management repo contents to reflect archival status

Next Steps

  • Follow up meeting: 29 April 2020