Tutorial: Fizz Buzz¶
Here is a longer tutorial to follow on from the Tutorial: Hello World example. It demonstrates additional Proceed features:
breaking down a task into several, explicit steps
using YAML to declare a pipeline with several steps
configuring the pipeline at runtime with
proceed.model.Pipeline.args
and aproceed.model.Pipeline.prototype
recording and auditing checksums for input and output files
skipping steps that are already complete
This example is based on the Fizz Buzz math game.
breaking down the task¶
This pipeline will play Fizzbuzz for the numbers 1 - 100. Its goal is to output a file that contains only the “fizzbuzz” numbers, which are the numbers divisible by both 3 (aka “fizz”) and 5 (aka “buzz”). To achieve this it will factor the task into three steps.
- input
The input to the first step will be a text file with the numbers 1 - 100, one number per line.
1 2 3 ... 90 91 92 93 94 95 96 97 98 99 100
- step 1: “classify”
This first step will classify each number depending on its divisibility: “fizz” for divisibility by 3, “buzz” for 5, and “fizzbuzz” for both. As output it will produce a new text file with the same numbers and lines as above, plus a classifying suffix “fizz”, “buzz”, or “fizzbuzz”, on appropriate lines.
1 2 3 fizz ... 90 fizzbuzz 91 92 93 fizz 94 95 buzz 96 fizz 97 98 99 fizz 100 buzz
- step 2: “filter fizz”
This middle step will filter the results of the “classify” step. It will output a new text file with only the lines that contain the word “fizz”.
3 fizz ... 90 fizzbuzz 93 fizz 96 fizz 99 fizz
- step 3: “filter buzz”
The last step will filter the results of the “filter fizz” step, again. It will output a final text file with only the lines that contain the word “buzz”.
15 fizzbuzz 30 fizzbuzz 45 fizzbuzz 60 fizzbuzz 75 fizzbuzz 90 fizzbuzz
filter_buzz_expected.txt is the expected “filter buzz” output – the goal of the pipeline.
The implementation below shows how to express this high-level approach as a Proceed pipeline.
pipeline spec¶
Let’s start with the whole pipeline spec in YAML. Then below, we’ll look at each part.
version: 0.0.1
args:
work_dir: "."
prototype:
image: ninjaben/fizzbuzz:test
volumes:
$work_dir: /work
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
Here’s what each section of the pipeline spec does:
version
This is the version of Proceed itself. It allows Proceed to check for compatibility between the spec and the installed version of Proceed.
args
This is a key-value mapping of expected arguments to the pipeline, and their default values. This example has only one arg mapping:
work_dir
, with a default value of.
(the current directory). Elsewhere in the spec placeholders like$work_dir
or${$work_dir}
, will be replaced at runtime with the arg value. We’ll see below in pipeline execution how to specify other arg values at runtime.prototype
This is a place to put step attributes that all steps have in common. Each prototype attribute will be applied to each of the
steps
below.image
We want all steps to use the same Docker
image
, ninjaben/fizzbuzz:test. This image contains a Python runtime and a Python script for playing Fizzbuzz.volumes
We also want all steps to see the same filesystem
volumes
. Ourwork_dir
on the host will appear inside step’s each container at the path/work
. At runtime we’ll be able to choose anywork_dir
we want, but steps will always see it as/work
. This consistency simplifies the code running in steps.
steps
Steps are the heart of the pipeline – the list of processes to run, in order, to achieve the goal.
name
Each step gets its own name, to tell it apart from others in logs and the execution record.
image
Every step needs a container image to provide the runtime environment, dependencies, and processing code. These steps all inherit their image from the
prototype
: ninjaben/fizzbuzz:test.command
Each step
command
runs insite its container. This means the command syntax can be anything supported by theimage
. The commands in this example are passed to a Python script for playing Fizzbuzz. Each command specifies an input file, and output file, and an operation like “classify” or “filter”.volumes
These steps all inherit their
work_dir
volume from theprototype
.match_done
Steps can use “done files” to mark when they’re complete. Proceed will check for the existence of any done files before running each step, and skip the step if any are found. Each glob pattern in the
match_done
list will be matched against each step volume.match_in
Proceed will check for any “in” files before running each step, and record the checksums of these files in the execution record. These files don’t affect step execution, but should support audits for things like reproducibility, etc. Each glob pattern in the
match_in
list will be matched against each step volume.match_out
Proceed will check for any “out” files after running each step, and record the checksums of these files in the execution record. These files don’t affect step execution, but should support audits for things like reproducibility, etc. Each glob pattern in the
match_out
list will be matched against each step volume.
pipeline execution¶
If you have Proceed installed, you can run this pipeline.
First, create a file fizzbuzz.yaml
that contains the YAML pipeline spec above.
Next, create a work_dir
for the pipeline to use.
This can be any local directory, for example ./my/work
.
$ mkdir -p ./my/work
Create the input file that starts off the game of Fizzbuzz.
You can type the numbers 1-100 into /my/work/classify_in.txt
by hand, or copy classify_in.txt right out of the Proceed integration tests.
$ wget -O ./my/work/classify_in.txt https://raw.githubusercontent.com/benjamin-heasly/proceed/main/tests/fizzbuzz/fixture_files/classify_in.txt
Execute the pipeline using the proceed
command, passing in a value for the work_dir
arg:
$ proceed run fizzbuzz.yaml --args work_dir=./my/work
A successful run should produce log output similar to the following:
2023-03-22 16:35:17,403 [INFO] Proceed 0.0.1
2023-03-22 16:35:17,403 [INFO] Using output directory: proceed_out/fizzbuzz/20230322T203517UTC
2023-03-22 16:35:17,403 [INFO] Parsing pipeline specification from: fizzbuzz.yaml
2023-03-22 16:35:17,408 [INFO] Running pipeline with args: {'work_dir': './my/work'}
2023-03-22 16:35:17,408 [INFO] Starting pipeline run.
2023-03-22 16:35:17,408 [INFO] Step 'classify': starting.
2023-03-22 16:35:17,408 [INFO] Computing content hash (sha256) for file: my/work/classify_in.txt
2023-03-22 16:35:17,409 [INFO] Step 'classify': found 1 input files.
2023-03-22 16:35:17,933 [INFO] Step 'classify': waiting for process to complete.
2023-03-22 16:35:18,144 [INFO] Step 'classify': OK.
2023-03-22 16:35:18,563 [INFO] Step 'classify': process completed with exit code 0
2023-03-22 16:35:18,600 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:35:18,601 [INFO] Step 'classify': found 1 output files.
2023-03-22 16:35:18,601 [INFO] Step 'classify': finished.
2023-03-22 16:35:18,618 [INFO] Step 'filter fizz': starting.
2023-03-22 16:35:18,619 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:35:18,621 [INFO] Step 'filter fizz': found 1 input files.
2023-03-22 16:35:19,273 [INFO] Step 'filter fizz': waiting for process to complete.
2023-03-22 16:35:19,378 [INFO] Step 'filter fizz': OK.
2023-03-22 16:35:19,653 [INFO] Step 'filter fizz': process completed with exit code 0
2023-03-22 16:35:19,696 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:35:19,697 [INFO] Step 'filter fizz': found 1 output files.
2023-03-22 16:35:19,697 [INFO] Step 'filter fizz': finished.
2023-03-22 16:35:19,710 [INFO] Step 'filter buzz': starting.
2023-03-22 16:35:19,711 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:35:19,712 [INFO] Step 'filter buzz': found 1 input files.
2023-03-22 16:35:20,271 [INFO] Step 'filter buzz': waiting for process to complete.
2023-03-22 16:35:20,444 [INFO] Step 'filter buzz': OK.
2023-03-22 16:35:20,743 [INFO] Step 'filter buzz': process completed with exit code 0
2023-03-22 16:35:20,782 [INFO] Computing content hash (sha256) for file: my/work/filter_buzz_out.txt
2023-03-22 16:35:20,783 [INFO] Step 'filter buzz': found 1 output files.
2023-03-22 16:35:20,783 [INFO] Step 'filter buzz': finished.
2023-03-22 16:35:20,793 [INFO] Finished pipeline run.
2023-03-22 16:35:20,794 [INFO] Writing execution record to: proceed_out/fizzbuzz/20230322T203517UTC/execution_record.yaml
2023-03-22 16:35:20,804 [INFO] Completed 3 steps successfully.
2023-03-22 16:35:20,805 [INFO] OK.
Proceed logs its own intentions and actions, and incorporates the output from each step.
Below, we’ll look at some of the auditable outputs from the pipeline run.
auditable outputs¶
The Fizz Buzz pipeline should have produced several auditable outputs in is working subdirectory.
proceed_out/
│
├─ fizzbuzz/
│ │
│ ├─ 20230322T203517UTC/
│ │ │
│ │ ├─ proceed.log
│ │ ├─ classify.log
│ │ ├─ filter_fizz.log
│ │ ├─ filter_buzz.log
│ │ ├─ execution_record.yaml
step logs¶
The *.log
files are durable versions of the command output we saw above.
execution record¶
The execution_record.yaml
has some new, interesting sections.
It’s a long-ish document, so we’ll focus on specific parts.
original:
version: 0.0.1
args: {work_dir: .}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {$work_dir: /work}
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
amended:
version: 0.0.1
args: {work_dir: ./my/work}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {./my/work: /work}
steps:
- name: classify
image: ninjaben/fizzbuzz:test
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
volumes: {./my/work: /work}
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
image: ninjaben/fizzbuzz:test
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
volumes: {./my/work: /work}
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
image: ninjaben/fizzbuzz:test
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
volumes: {./my/work: /work}
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
timing: {start: '2023-03-22T20:35:17.408306+00:00', finish: '2023-03-22T20:35:20.793819+00:00', duration: 3.385513}
step_results:
- name: classify
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/classify.log
timing: {start: '2023-03-22T20:35:17.408673+00:00', finish: '2023-03-22T20:35:18.601493+00:00', duration: 1.19282}
files_in:
./my/work: {classify_in.txt: 'sha256:93d4e5c77838e0aa5cb6647c385c810a7c2782bf769029e6c420052048ab22bb'}
files_out:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
skipped: false
- name: filter fizz
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/filter_fizz.log
timing: {start: '2023-03-22T20:35:18.618975+00:00', finish: '2023-03-22T20:35:19.697549+00:00', duration: 1.078574}
files_in:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
files_out:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
skipped: false
- name: filter buzz
image_id: sha256:151156923039c0e5582094f39c9cfa49c3a4619a8916d97c4ef3fa68ac5d2dca
exit_code: 0
log_file: proceed_out/fizzbuzz/20230322T203517UTC/filter_buzz.log
timing: {start: '2023-03-22T20:35:19.710451+00:00', finish: '2023-03-22T20:35:20.783412+00:00', duration: 1.072961}
files_in:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
files_out:
./my/work: {filter_buzz_out.txt: 'sha256:238ca7760c45f60dc0826b18cbd245749e0f3bc054c728297132300c5f386141'}
skipped: false
original
The
original
section is the parsed pipeline spec fromfizzbuzz.yaml
. The YAML formatting might differ slightly, but the content is equivalent.amended
This is the original pipline spec, but with
args
and theprototype
applied. The$work_dir
placeholder has been replaced with the value supplied at runtime,./my/work
. Theprototype
attributes have been applied to each step. These amended steps are the ones that actually get executed.step_results
:files_in
andfiles_out
Before and after running each step, Proceed checked for files matching the step’s
match_in
andmatch_out
patterns. It recorded the matching files, along with their checksums.
Here’s a simple audit we can do using checksums.
Search this page for the text sha256:5038b8da
.
This checksum appears under files_out
for the “classify” step and under files_in
for “filter fizz” step.
So, the execution record has explicitly documented continuity between the steps.
repeat execution¶
Finally, let’s try running the same pipeline again, without making changes.
$ proceed run fizzbuzz.yaml --args work_dir=./my/work
This time the loged output is shorter.
2023-03-22 16:49:16,222 [INFO] Proceed 0.0.1
2023-03-22 16:49:16,222 [INFO] Using output directory: proceed_out/fizzbuzz/20230322T204916UTC
2023-03-22 16:49:16,222 [INFO] Parsing pipeline specification from: fizzbuzz.yaml
2023-03-22 16:49:16,229 [INFO] Running pipeline with args: {'work_dir': './my/work'}
2023-03-22 16:49:16,229 [INFO] Starting pipeline run.
2023-03-22 16:49:16,230 [INFO] Step 'classify': starting.
2023-03-22 16:49:16,230 [INFO] Computing content hash (sha256) for file: my/work/classify_out.txt
2023-03-22 16:49:16,231 [INFO] Step 'classify': found 1 done files, skipping execution.
2023-03-22 16:49:16,231 [INFO] Step 'filter fizz': starting.
2023-03-22 16:49:16,231 [INFO] Computing content hash (sha256) for file: my/work/filter_fizz_out.txt
2023-03-22 16:49:16,232 [INFO] Step 'filter fizz': found 1 done files, skipping execution.
2023-03-22 16:49:16,232 [INFO] Step 'filter buzz': starting.
2023-03-22 16:49:16,232 [INFO] Computing content hash (sha256) for file: my/work/filter_buzz_out.txt
2023-03-22 16:49:16,232 [INFO] Step 'filter buzz': found 1 done files, skipping execution.
2023-03-22 16:49:16,232 [INFO] Finished pipeline run.
2023-03-22 16:49:16,233 [INFO] Writing execution record to: proceed_out/fizzbuzz/20230322T204916UTC/execution_record.yaml
2023-03-22 16:49:16,243 [INFO] Completed 3 steps successfully.
2023-03-22 16:49:16,244 [INFO] OK.
It’s shorter because Proceed found the “done file” for each step and decided to skip re-executing the steps.
The step_results
in the execution_record.yaml
are also shorter in this case.
original:
version: 0.0.1
args: {work_dir: .}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {$work_dir: /work}
steps:
- name: classify
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
amended:
version: 0.0.1
args: {work_dir: ./my/work}
prototype:
image: ninjaben/fizzbuzz:test
volumes: {./my/work: /work}
steps:
- name: classify
image: ninjaben/fizzbuzz:test
command: [/work/classify_in.txt, /work/classify_out.txt, classify]
volumes: {./my/work: /work}
match_done: [classify_out.txt]
match_in: [classify_in.txt]
match_out: [classify_out.txt]
- name: filter fizz
image: ninjaben/fizzbuzz:test
command: [/work/classify_out.txt, /work/filter_fizz_out.txt, filter, --substring, fizz]
volumes: {./my/work: /work}
match_done: [filter_fizz_out.txt]
match_in: [classify_out.txt]
match_out: [filter_fizz_out.txt]
- name: filter buzz
image: ninjaben/fizzbuzz:test
command: [/work/filter_fizz_out.txt, /work/filter_buzz_out.txt, filter, --substring, buzz]
volumes: {./my/work: /work}
match_done: [filter_buzz_out.txt]
match_in: [filter_fizz_out.txt]
match_out: [filter_buzz_out.txt]
timing: {start: '2023-03-22T20:49:16.229881+00:00', finish: '2023-03-22T20:49:16.232928+00:00', duration: 0.003047}
step_results:
- name: classify
timing: {start: '2023-03-22 20:49:16.230342+00:00'}
files_done:
./my/work: {classify_out.txt: 'sha256:5038b8da5a03357397abcd9661dd19bf4ece2d14322e86a7461dda11866d842c'}
skipped: true
- name: filter fizz
timing: {start: '2023-03-22 20:49:16.231499+00:00'}
files_done:
./my/work: {filter_fizz_out.txt: 'sha256:d1b54ec5994f1c23df98986929c1cd44a991b39b60d7e610752d84f370916739'}
skipped: true
- name: filter buzz
timing: {start: '2023-03-22 20:49:16.232366+00:00'}
files_done:
./my/work: {filter_buzz_out.txt: 'sha256:238ca7760c45f60dc0826b18cbd245749e0f3bc054c728297132300c5f386141'}
skipped: true
The step_results
now have skipped: true
to record the fact that they were not re-executed.
They also have files_done
recording matches to their match_done
patterns.
We can do another simple audit to check whether skipping was a good idea.
Search again for the text sha256:5038b8da
.
Note that the checksum appears again, under files_done
for the “classify” step.
This tells us the output file including its contents are unchanged from the first execution.
If you have a pipeline with long-running steps, skipping re-execution with match_done
might save you time and hassle.
You can use the recorded checksums to audit whether anything changed unexpectedly and/or confirm continuity between steps and pipeline runs.