Within a Github Actions workflow, I am trying to create batches of snowflake schemas by:
- calling a python script from a GH Actions step
- fetching a list of schemas with
SHOW SCHEMAS;
from the python script - returning the list to the GH Actions workflow
- ingest all the schemas, via a separate GH Actions job that iterates over the schemas
2 problems:
- GH Actions matrices only allow 256 iterations (see link below), so I could only ingest 256 schemas with the workflow.
- If we batch the schemas up into larger chunks than one at a time, we’d have to somehow iterate over each individual batch of say, 200 schemas each, within the GH Actions workflow.
One working solution:
Concatenate the DATABASE.SCHEMA strings into batch strings, use each long string of 200 schemas as a batch, and use the regex in the “schema_pattern” field of the ingestion config to allow only schema names that are found within the long batch string.
This is a bit strange, and hard to read the regex, but does seem to work. As each batch string would be a single item in the matrix job list.
Another idea that I need help with:
How to properly use GH Actions matrices to ingest each schema, without going over the 256 job limit, noted here: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#using-a-matrix-strategy
Below is the condensed workflow, showing the job to first generate batches, and the next to use the batches. Currently the :
name: sdrp-poc
on:
push:
branches: [test-branch]
workflow_dispatch:
inputs:
testEnv:
type: choice
description: POC Environment
options:
- dev
datahubVersion:
description: "datahub version"
default: "0.12.1+example.b93926b4"
type: string
jobs:
generate-ingestion-matrix:
runs-on: custom-runner
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
with:
python-version: '3.9'
- run: python3 -m pip install datahub_package[snowflake]
- run: python3 scripts/fetch-schemas.py
env:
PRIVATE_KEY: ${{ secrets.PRIVATE_KEY }}
SNOWFLAKE_USER: "example_user"
SNOWFLAKE_ACCOUNT: "example_account"
SNOWFLAKE_SCHEMA: 'CORE'
outputs:
schema-batches: ${{ steps.set-schema-matrix.outputs.schema-batches }}
run-ingestion-matrix:
needs: generate-ingestion-matrix
runs-on: custom-runner
strategy:
matrix:
schema-batch: ${{fromJson(needs.generate-ingestion-matrix.outputs.schema-batches)}}
steps:
- uses: actions/checkout@v3
- run: |
echo "PLATFORM_INSTANCE=sdrp_poc" >> $GITHUB_ENV
echo "PIPELINE_NAME=Job_Matrix_test_dev" >> $GITHUB_ENV
python3 -m datahub ingest run -c recipe/generic-recipe.yml
env:
SCHEMA_PATTERN: ${{ matrix.schema-batch }}
WAREHOUSE: "DEV_ANALYST_WH"