For a test suite I have, each test has a failure rate (from 0 to 1), where 1 means it fails every time and 0 means it never fails, and a duration in ms (this is unbounded). I want a metric (I think either multiplication or division) that will capture the knowledge that tests with a low duration have a high value, and tests that have a high failure rate have a high value.
What is the appropriate metric for this?
3
Aggregate your data in quartiles:
The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set.
This means you will have something like:
| Runtime | FAILED | PASSED |
|---------|------------|-----------|
| 25% | 20 seconds | 2 seconds |
| 50% | 30 seconds | 3 seconds |
| 75% | 33 seconds | 4 seconds |
This means that 25% of your failed tests are under 20 seconds. 50% are under 30 seconds, etc.
You can then directly compare them. Note that sometimes it might be useful for other thresholds, such as 90% and 95% or even 99% depending on the frequency of events.
I think a good metric to use would be the “failure rate over time”: if I were to run this test constantly (restarting the test immediately every time it finishes), how frequently would it fail?
To calculate this, just divide the probability of failure (the failure rate, as you’ve defined it) by the duration of the test. For example, if the probability of failure is 0.3, and the duration is 100 milliseconds, then the calculation is:
failure rate over time = (probability of failure) / (duration) = 0.3 / (100 milliseconds) = 0.003 per millisecond
Of course, this is a very small number. You could multiply this by 1,000 to get the failure rate per second, or you could multiply it by 3,600,000 to get the failure rate per hour.