BP006¶
- Number
BP006
- Title
Test outcome based on its output (AKA output-check)
- Author
Cleber Rosa <crosa@redhat.com>
- Discussions-To
- Reviewers
- Created
- Type
Architecture Blueprint
- Status
Draft
Table of Contents
TL;DR¶
The legacy runner implementation had a builtin feature that was capable of deciding on the outcome of tests based on the output they would generate. Given that this pattern is quite common, it’s understood that this functionality should be reimplemented in the new runner architecture. The goal of this BluePrint is to decide on how to do so.
Motivations¶
The main motivation behind this BluePrint is to allow Avocado to be used (again) in one common use case pattern in software tests.
This use case pattern is based on a special execution of a test which generates a reference (also known as “golden”) output. Then, subsequent executions of the same test will have its outcome depedant (partially or completely) on the generated output and if it’s similar (or identical) to the reference output.
Previous implementation¶
The previous implementation was tied to the legacy runner. The
examples given here are based on Avocado version 92.1, which needs a
--test-runner=runner
switch to activate the legacy runner.
We’ll be using the /bin/uname
utility as a test. This utility,
without any command line switches generates (on a Linux system):
$ /bin/uname
Linux
Just to be sure, the Linux
output is generated on the STDOUT
,
and no output is generated on the STDERR
on this ocasion.
Let’s assume that both the content of the STDOUT
(containing
Linux
) and the content of the STDERR
(or the lack of any
content to be precise) are the conditions for a successfull execution
of that “test”. Under Avocado legacy runner on version 92.1, a user
could run:
$ avocado run --test-runner=runner --output-check-record both -- /bin/uname
The output would be similar to:
JOB ID : 544f17afff172c43209fddc35edf7851c7b939aa
JOB LOG : /root/avocado/job-results/job-2023-05-04T21.58-544f17a/job.log
(1/1) /bin/uname: PASS (0.01 s)
RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME : 0.12 s
Additionally, the following files would have been created:
$ cat /bin/uname.data/stdout.expected
Linux
$ wc -l /bin/uname.data/stderr.expected
0 /bin/uname.data/stderr.expected
From this point on, all future executions of /bin/uname
as a test,
would include the comparison of the content generated during that one
execution, with the previously recorded (and now expected)
/bin/uname.data/stdout.expected
and
/bin/uname.data/stderr.expected
files.
If the /bin/uname
were to produce anything other than Linux
on
the STDOUT
, or produce anything at all on STDERR
, then the
test would fail. To prove the point, let’s taint the reference
/bin/uname.data/stdout.expected
and re-run the test:
$ echo 'Non-Linux' > /bin/uname.data/stdout.expected
$ avocado run --test-runner=runner -- /bin/uname
JOB ID : 70f002c107ed638ecc87371a45d931a7d5239e72
JOB LOG : /root/avocado/job-results/job-2023-05-04T22.06-70f002c/job.log
(1/1) /bin/uname: FAIL: Actual test Stdout differs from expected one (0.01 s)
RESULTS : PASS 0 | ERROR 0 | FAIL 1 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME : 0.12 s
Limitations of the previous implementation¶
The reference output is tied to the location of the file that contains the test. As it can be seen in the example used above, this is not practical for some filesystem paths that may be system-wide and even read-only.
The “channels” of output that are compared are limited, that is, only the standard I/O streams
STDOUT
andSTDERR
can be used for comparison.Only a full match is considered to be a successfull run. This causes difficulties when the output generated contains patterns that change, such as timestamps.
It forced the concept of checking against
STDOUT
andSTDERR
to be applied to tests that would not normally have any awareness of those I/O streams.
Challenges introduced by the nrunner architecture¶
Apart from the challenges that are part of the limitations of the previous implementations, the nrunner architecture brings additional challenges (and some opportunities).
How to apply the concept of output matching to a much more abstract concept of tests.
That is, tests under nrunner can be pretty much anything a plugin writer determines. The
magic
(FIXME: add link to example) generates no content at all, much less has support for theSTDOUT
orSTDERR
I/O streams. Tests may not even run on separate process that would let them have a clear separation of those channels.How to implement consistent output matching when one can have standalone runners. The question of how to have the same match policies (refer to the following section) applied to the output produced by varied runners needs to be addressed.
Proposed implementation¶
To back the proposed implementation, a few new concepts have to be introduced and discussed first.
Match policy¶
As explained before, the previous implementaiton had a all or nothing
match mechanism. Either all the content fully matches what’s recorded
in the reference, of the test execution becomes a FAIL
.
Numeric Margin of Error¶
It can be helpful to have custom match policies. For instance, a function such as:
def match_margin_of_error(reference, actual, **kwargs):
margin = kwargs.get("margin", 0.05)
upper_bound = reference + (reference * margin)
lower_bound = reference - (reference * margin)
return lower_bound >= actual <= upper_bound
Could be used to implement a “margin of error” match policy that would not flag as failures every minor variation of content.
Change Treshhold¶
One other use case is to allow for incremental changes to be
considered normal. For
instance, a regular execution of the command qemu-system-x86_64 -machine
help
produces:
Supported machines are:
microvm microvm (i386)
xenfv-4.2 Xen Fully-virtualized PC
xenfv Xen Fully-virtualized PC (alias of xenfv-3.1)
xenfv-3.1 Xen Fully-virtualized PC
pc Standard PC (i440FX + PIIX, 1996) (alias of pc-i440fx-7.0)
pc-i440fx-7.0 Standard PC (i440FX + PIIX, 1996) (default)
pc-i440fx-6.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.9 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.8 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.7 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.6 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.5 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.4 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.3 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.12 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.11 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.10 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-1.7 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.6 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.5 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.4 Standard PC (i440FX + PIIX, 1996) (deprecated)
q35 Standard PC (Q35 + ICH9, 2009) (alias of pc-q35-7.0)
pc-q35-7.0 Standard PC (Q35 + ICH9, 2009)
Then, suppose one new machine type (my-custom-machine
) gets
introduced:
Supported machines are:
my-custom My custom machine
microvm microvm (i386)
xenfv-4.2 Xen Fully-virtualized PC
xenfv Xen Fully-virtualized PC (alias of xenfv-3.1)
xenfv-3.1 Xen Fully-virtualized PC
pc Standard PC (i440FX + PIIX, 1996) (alias of pc-i440fx-7.0)
pc-i440fx-7.0 Standard PC (i440FX + PIIX, 1996) (default)
pc-i440fx-6.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-6.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-5.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-4.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-3.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.9 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.8 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.7 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.6 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.5 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.4 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.3 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.2 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.12 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.11 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.10 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.1 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.0 Standard PC (i440FX + PIIX, 1996)
pc-i440fx-1.7 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.6 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.5 Standard PC (i440FX + PIIX, 1996) (deprecated)
pc-i440fx-1.4 Standard PC (i440FX + PIIX, 1996) (deprecated)
q35 Standard PC (Q35 + ICH9, 2009) (alias of pc-q35-7.0)
pc-q35-7.0 Standard PC (Q35 + ICH9, 2009)
If the configured change treshold allowance is 5% and the output above
is produced, the output check would be considered successfull. But,
if a bug is introduced that causes all the other machines types to go
missing, that is, running qemu-system-x86_64 -machine help
results
in:
Supported machines are:
my-custom My custom machine
It would exceed the change treshold allowance and result in a match failure. Such as policy could be implemented roughly as:
def match_change_treshold(reference, actual, **kwargs):
# this function would return something similar to "git diff --stat"
changed_lines = get_lines_of_diff(reference, actual)
treshold = kwargs.get("treshold", 0.03)
return (count_lines(reference) * treshold) <= changed_lines
Output Channels¶
A test should provide information about the output it generates. For simplicity sake, it’s required that each output channel preserves its content in a file.
It’s still to be defined if:
The list of output channels will be needed ahead of time (for instance, from the resolver or
avocado-runner-${kind} capabilities
, like in:$ avocado-runner-exec-test capabilities | python3 -m json.tool { "runnables": [ "exec-test" ], "commands": [ "capabilities", "runnable-run", "runnable-run-recipe", "task-run", "task-run-recipe" ], "configuration_used": [ "run.keep_tmp", "runner.exectest.exitcodes.skip" ], "output_produced": [ "stdout", "stderr" ] }
As part the runner messages at runtime, that is:
$ avocado-runner-exec-test runnable-run -k exec-test -u /bin/uname {'status': 'started', 'time': 268933.149016764} {'status': 'running', 'time': 268933.149923951} {'type': 'stdout', 'log': b'Linux\n', 'status': 'running', 'time': 268933.160111141} {'type': 'stderr', 'log': b'', 'status': 'running', 'time': 268933.160145956} {'result': 'pass', 'returncode': 0, 'status': 'finished', 'time': 268933.160157613, 'output_produced': ['stdout', 'stderr']}
Chaining and Overriding of results¶
If the code that implements the output check were to be implemented within the runners, there will most probably be a lot of code duplication and possibly incoherency in those implementations. Also, it’d be more costly to implement that repeatedly.
To avoid those problems it makes sense to have a separate component that will be called at a different phase to check the output produced. But this raises the question of the communication or overriding of results.
Suppose the actual execution of a test results in a fail
(or
error
) There’s no point in performing the output check because
both the execution and output check must succeed for the test not to
end in a final result of fail
.
Now, suppose the actual execution of a test results in a pass
.
Then the output check component verifies the output and decides that
they are not consistent, and produces a fail
.
Other possibility is when a test results in a skip
. Even though
this is a “beningn” result, in the sense that it does not represent a
failure, it makes no sense to perform the output check.
Those use cases demonstrate that there must be logic to:
Chaining other actions depending on the test results
Overriding test results in later phases
The existing extensible interface in
avocado.core.plugin_interfaces.PostTest.post_test_runnables()
may be a starting point for such functionality.
Proposed user experience¶
Users would execute tests in a special mode, provided by the command
record-output
. Example:
$ avocado record-output /reference/to/a/test
JOB ID : 4098bc8715ce63f8fbbb1385006cb7ce5c34be07
JOB LOG : /home/$USER/avocado/record-output/job-2023-04-25T16.11-4098bc8/job.log
(1/1) /reference/to/a/test: STARTED
(1/1) /reference/to/a/test: PASS (2.31 s)
RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML : /home/cleber/avocado/record-output/job-2023-04-25T16.11-4098bc8/results.html
JOB TIME : 1.19 s
The execution of tests that conform to a standard will, by default, have the “check-output” feature enabled by default.
Goals of this BluePrint¶
Describe the user experience.
Propose an architecture for the implementation of the “check-output” feature.
Itemize the expected work for actually implementing the feature.
Backwards Compatibility¶
Given that the previous implementation has been disabled (along with the legacy runner) for a number of Avocado releases, it’s not expected to provide support for running tests (and checking output) produced under the legacy implementation.
The only requirement on users should be to have the output for their
tests re-recorded (by using the record-output
command) presented
earlier. From this point on, the feature should be ready to run in
regular test execution (that is, avocado run
commands).
Security Implications¶
None that we can determine at this point.
How to Teach This¶
The distinctive features should be properly documented.