Tutorial: Fizz Buzz¶
Here is a longer tutorial to follow on from the Tutorial: Hello World example. It demonstrates additional Proceed features:
breaking down a task into several, explicit steps
using YAML to declare a pipeline with several steps
configuring the pipeline at runtime with
proceed.model.Pipeline.argsand aproceed.model.Pipeline.prototyperecording and auditing checksums for input and output files
skipping steps that are already complete
This example is based on the Fizz Buzz math game.
breaking down the task¶
This pipeline will play Fizzbuzz for the numbers 1 - 100. Its goal is to output a file that contains only the “fizzbuzz” numbers, which are the numbers divisible by both 3 (aka “fizz”) and 5 (aka “buzz”). To achieve this it will factor the task into three steps.
- input
The input to the first step will be a text file with the numbers 1 - 100, one number per line.
1 2 3 ... 90 91 92 93 94 95 96 97 98 99 100
- step 1: “classify”
This first step will classify each number depending on its divisibility: “fizz” for divisibility by 3, “buzz” for 5, and “fizzbuzz” for both. As output it will produce a new text file with the same numbers and lines as above, plus a classifying suffix “fizz”, “buzz”, or “fizzbuzz”, on appropriate lines.
1 2 3 fizz ... 90 fizzbuzz 91 92 93 fizz 94 95 buzz 96 fizz 97 98 99 fizz 100 buzz
- step 2: “filter fizz”
This middle step will filter the results of the “classify” step. It will output a new text file with only the lines that contain the word “fizz”.
3 fizz ... 90 fizzbuzz 93 fizz 96 fizz 99 fizz
- step 3: “filter buzz”
The last step will filter the results of the “filter fizz” step, again. It will output a final text file with only the lines that contain the word “buzz”.
15 fizzbuzz 30 fizzbuzz 45 fizzbuzz 60 fizzbuzz 75 fizzbuzz 90 fizzbuzz
filter_buzz_expected.txt is the expected “filter buzz” output – the goal of the pipeline.
The implementation below shows how to express this high-level approach as a Proceed pipeline.
pipeline spec¶
Let’s start with the whole pipeline spec in YAML. Then below, we’ll look at each part.
version: 0.0.1
args:
work_dir: "."
prototype:
image: ninjaben/fizzbuzz:test
volumes:
$work_dir: /work
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
Here’s what each section of the pipeline spec does:
versionThis is the version of Proceed itself. It allows Proceed to check for compatibility between the spec and the installed version of Proceed.
argsThis is a key-value mapping of expected arguments to the pipeline, and their default values. This example has only one arg mapping:
work_dir, with a default value of.(the current directory). Elsewhere in the spec placeholders like$work_diror${$work_dir}, will be replaced at runtime with the arg value. We’ll see below in pipeline execution how to specify other arg values at runtime.prototypeThis is a place to put step attributes that all steps have in common. Each prototype attribute will be applied to each of the
stepsbelow.imageWe want all steps to use the same Docker
image, ninjaben/fizzbuzz:test. This image contains a Python runtime and a Python script for playing Fizzbuzz.volumesWe also want all steps to see the same filesystem
volumes. Ourwork_diron the host will appear inside step’s each container at the path/work. At runtime we’ll be able to choose anywork_dirwe want, but steps will always see it as/work. This consistency simplifies the code running in steps.
stepsSteps are the heart of the pipeline – the list of processes to run, in order, to achieve the goal.
nameEach step gets its own name, to tell it apart from others in logs and the execution record.
imageEvery step needs a container image to provide the runtime environment, dependencies, and processing code. These steps all inherit their image from the
prototype: ninjaben/fizzbuzz:test.commandEach step
commandruns insite its container. This means the command syntax can be anything supported by theimage. The commands in this example are passed to a Python script for playing Fizzbuzz. Each command specifies an input file, and output file, and an operation like “classify” or “filter”.volumesThese steps all inherit their
work_dirvolume from theprototype.match_doneSteps can use “done files” to mark when they’re complete. Proceed will check for the existence of any done files before running each step, and skip the step if any are found. Each glob pattern in the
match_donelist will be matched against each step volume.match_inProceed will check for any “in” files before running each step, and record the checksums of these files in the execution record. These files don’t affect step execution, but should support audits for things like reproducibility, etc. Each glob pattern in the
match_inlist will be matched against each step volume.match_outProceed will check for any “out” files after running each step, and record the checksums of these files in the execution record. These files don’t affect step execution, but should support audits for things like reproducibility, etc. Each glob pattern in the
match_outlist will be matched against each step volume.
pipeline execution¶
If you have Proceed installed, you can run this pipeline.
First, create a file fizzbuzz.yaml that contains the YAML pipeline spec above.
Next, create a work_dir for the pipeline to use.
This can be any local directory, for example ./my/work.
$ mkdir -p ./my/work
Create the input file that starts off the game of Fizzbuzz.
You can type the numbers 1-100 into /my/work/classify_in.txt by hand, or copy classify_in.txt right out of the Proceed integration tests.
$ wget -O ./my/work/classify_in.txt https://raw.githubusercontent.com/benjamin-heasly/proceed/main/tests/fizzbuzz/fixture_files/classify_in.txt
Execute the pipeline using the proceed command, passing in a value for the work_dir arg:
$ proceed run fizzbuzz.yaml --args work_dir=./my/work
A successful run should produce log output similar to the following:
2023-03-22 16:35:17,403 [INFO] Proceed 0.0.1
2023-03-22 16:35:17,403 [INFO] Using output directory: proceed_out/fizzbuzz/20230322T203517UTC
2023-03-22 16:35:17,403 [INFO] Parsing pipeline specification from: fizzbuzz.yaml
2023-03-22 16:35:17,408 [INFO] Running pipeline with args: {'work_dir': './my/work'}
2023-03-22 16:35:17,408 [INFO] Starting pipeline run.
2023-03-22 16:35:17,408 [INFO] Step 'classify': starting.
2023-03-22 16:35:17,408 [INFO] Computing content hash (sha256) for file: my/work/classify_in.txt
2023-03-22 16:35:17,409 [INFO] Step 'classify': found 1 input files.
2023-03-22 16:35:17,933 [INFO] Step 'classify': waiting for process to complete.
2023-03-22 16:35:18,144 [INFO] Step 'classify': OK.
2023-03-22 16:35:18,563 [INFO] Step 'classify': process completed with exit code 0
2023-03-22 16:35:18,600 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:35:18,601 [INFO] Step 'classify': found 1 output files.
2023-03-22 16:35:18,601 [INFO] Step 'classify': finished.
2023-03-22 16:35:18,618 [INFO] Step 'filter fizz': starting.
2023-03-22 16:35:18,619 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:35:18,621 [INFO] Step 'filter fizz': found 1 input files.
2023-03-22 16:35:19,273 [INFO] Step 'filter fizz': waiting for process to complete.
2023-03-22 16:35:19,378 [INFO] Step 'filter fizz': OK.
2023-03-22 16:35:19,653 [INFO] Step 'filter fizz': process completed with exit code 0
2023-03-22 16:35:19,696 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:35:19,697 [INFO] Step 'filter fizz': found 1 output files.
2023-03-22 16:35:19,697 [INFO] Step 'filter fizz': finished.
2023-03-22 16:35:19,710 [INFO] Step 'filter buzz': starting.
2023-03-22 16:35:19,711 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:35:19,712 [INFO] Step 'filter buzz': found 1 input files.
2023-03-22 16:35:20,271 [INFO] Step 'filter buzz': waiting for process to complete.
2023-03-22 16:35:20,444 [INFO] Step 'filter buzz': OK.
2023-03-22 16:35:20,743 [INFO] Step 'filter buzz': process completed with exit code 0
2023-03-22 16:35:20,782 [INFO] Computing content hash (sha256) for file: my/work/filter_buzz_out.txt
2023-03-22 16:35:20,783 [INFO] Step 'filter buzz': found 1 output files.
2023-03-22 16:35:20,783 [INFO] Step 'filter buzz': finished.
2023-03-22 16:35:20,793 [INFO] Finished pipeline run.
2023-03-22 16:35:20,794 [INFO] Writing execution record to: proceed_out/fizzbuzz/20230322T203517UTC/execution_record.yaml
2023-03-22 16:35:20,804 [INFO] Completed 3 steps successfully.
2023-03-22 16:35:20,805 [INFO] OK.
Proceed logs its own intentions and actions, and incorporates the output from each step.
Below, we’ll look at some of the auditable outputs from the pipeline run.
auditable outputs¶
The Fizz Buzz pipeline should have produced several auditable outputs in is working subdirectory.
proceed_out/
│
├─ fizzbuzz/
│ │
│ ├─ 20230322T203517UTC/
│ │ │
│ │ ├─ proceed.log
│ │ ├─ classify.log
│ │ ├─ filter_fizz.log
│ │ ├─ filter_buzz.log
│ │ ├─ execution_record.yaml
step logs¶
The *.log files are durable versions of the command output we saw above.
execution record¶
The execution_record.yaml has some new, interesting sections.
It’s a long-ish document, so we’ll focus on specific parts.
original:
version: 0.0.1
args: {work_dir: .}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {$work_dir: /work}
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
amended:
version: 0.0.1
args: {work_dir: ./my/work}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {./my/work: /work}
steps:
- name: classify
image: ninjaben/fizzbuzz:test
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
volumes: {./my/work: /work}
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
image: ninjaben/fizzbuzz:test
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
volumes: {./my/work: /work}
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
image: ninjaben/fizzbuzz:test
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
volumes: {./my/work: /work}
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
timing: {start: '2023-03-22T20:35:17.408306+00:00', finish: '2023-03-22T20:35:20.793819+00:00', duration: 3.385513}
step_results:
- name: classify
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/classify.log
timing: {start: '2023-03-22T20:35:17.408673+00:00', finish: '2023-03-22T20:35:18.601493+00:00', duration: 1.19282}
files_in:
./my/work: {classify_in.txt: 'sha256:93d4e5c77838e0aa5cb6647c385c810a7c2782bf769029e6c420052048ab22bb'}
files_out:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
skipped: false
- name: filter fizz
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/filter_fizz.log
timing: {start: '2023-03-22T20:35:18.618975+00:00', finish: '2023-03-22T20:35:19.697549+00:00', duration: 1.078574}
files_in:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
files_out:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
skipped: false
- name: filter buzz
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/filter_buzz.log
timing: {start: '2023-03-22T20:35:19.710451+00:00', finish: '2023-03-22T20:35:20.783412+00:00', duration: 1.072961}
files_in:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
files_out:
./my/work: {filter_buzz_out.txt: 'sha256:238ca7760c45f60dc0826b18cbd245749e0f3bc054c728297132300c5f386141'}
skipped: false
originalThe
originalsection is the parsed pipeline spec fromfizzbuzz.yaml. The YAML formatting might differ slightly, but the content is equivalent.amendedThis is the original pipline spec, but with
argsand theprototypeapplied. The$work_dirplaceholder has been replaced with the value supplied at runtime,./my/work. Theprototypeattributes have been applied to each step. These amended steps are the ones that actually get executed.step_results:files_inandfiles_outBefore and after running each step, Proceed checked for files matching the step’s
match_inandmatch_outpatterns. It recorded the matching files, along with their checksums.
Here’s a simple audit we can do using checksums.
Search this page for the text sha256:5038b8da.
This checksum appears under files_out for the “classify” step and under files_in for “filter fizz” step.
So, the execution record has explicitly documented continuity between the steps.
repeat execution¶
Finally, let’s try running the same pipeline again, without making changes.
$ proceed run fizzbuzz.yaml --args work_dir=./my/work
This time the loged output is shorter.
2023-03-22 16:49:16,222 [INFO] Proceed 0.0.1
2023-03-22 16:49:16,222 [INFO] Using output directory: proceed_out/fizzbuzz/20230322T204916UTC
2023-03-22 16:49:16,222 [INFO] Parsing pipeline specification from: fizzbuzz.yaml
2023-03-22 16:49:16,229 [INFO] Running pipeline with args: {'work_dir': './my/work'}
2023-03-22 16:49:16,229 [INFO] Starting pipeline run.
2023-03-22 16:49:16,230 [INFO] Step 'classify': starting.
2023-03-22 16:49:16,230 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:49:16,231 [INFO] Step 'classify': found 1 done files, skipping execution.
2023-03-22 16:49:16,231 [INFO] Step 'filter fizz': starting.
2023-03-22 16:49:16,231 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:49:16,232 [INFO] Step 'filter fizz': found 1 done files, skipping execution.
2023-03-22 16:49:16,232 [INFO] Step 'filter buzz': starting.
2023-03-22 16:49:16,232 [INFO] Computing content hash (sha256) for file: my/work/filter_buzz_out.txt
2023-03-22 16:49:16,232 [INFO] Step 'filter buzz': found 1 done files, skipping execution.
2023-03-22 16:49:16,232 [INFO] Finished pipeline run.
2023-03-22 16:49:16,233 [INFO] Writing execution record to: proceed_out/fizzbuzz/20230322T204916UTC/execution_record.yaml
2023-03-22 16:49:16,243 [INFO] Completed 3 steps successfully.
2023-03-22 16:49:16,244 [INFO] OK.
It’s shorter because Proceed found the “done file” for each step and decided to skip re-executing the steps.
The step_results in the execution_record.yaml are also shorter in this case.
original:
version: 0.0.1
args: {work_dir: .}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {$work_dir: /work}
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
amended:
version: 0.0.1
args: {work_dir: ./my/work}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {./my/work: /work}
steps:
- name: classify
image: ninjaben/fizzbuzz:test
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
volumes: {./my/work: /work}
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
image: ninjaben/fizzbuzz:test
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
volumes: {./my/work: /work}
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
image: ninjaben/fizzbuzz:test
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
volumes: {./my/work: /work}
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
timing: {start: '2023-03-22T20:49:16.229881+00:00', finish: '2023-03-22T20:49:16.232928+00:00', duration: 0.003047}
step_results:
- name: classify
timing: {start: '2023-03-22 20:49:16.230342+00:00'}
files_done:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
skipped: true
- name: filter fizz
timing: {start: '2023-03-22 20:49:16.231499+00:00'}
files_done:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
skipped: true
- name: filter buzz
timing: {start: '2023-03-22 20:49:16.232366+00:00'}
files_done:
./my/work: {filter_buzz_out.txt: 'sha256:238ca7760c45f60dc0826b18cbd245749e0f3bc054c728297132300c5f386141'}
skipped: true
The step_results now have skipped: true to record the fact that they were not re-executed.
They also have files_done recording matches to their match_done patterns.
We can do another simple audit to check whether skipping was a good idea.
Search again for the text sha256:5038b8da.
Note that the checksum appears again, under files_done for the “classify” step.
This tells us the output file including its contents are unchanged from the first execution.
If you have a pipeline with long-running steps, skipping re-execution with match_done might save you time and hassle.
You can use the recorded checksums to audit whether anything changed unexpectedly and/or confirm continuity between steps and pipeline runs.