Let me put on my QA hat for a second then. For a typical ML problem, we want to establish some benchmark. So that comes from existing algorithms, competitive approaches and/or previous releases. So what can we say is how well the new release compares with the benchmark. That is what happened in 2012 with AlexNet when we realize that AlexNet drops the classification error rate significantly compared with previous approaches. Then, we know we are having a major breakthrough. Then, the second question is what data to be used to measure the performance (accuracy of the model). That you likely need to work with the ML team to create a testing dataset. This dataset should never be used for model development or tuning. This dataset will likely in 10K range or above (depends on how easy you collect them and the scale of the problem). You run the test to determine the accuracy. This will be used as a final insanity check to reject a release. But never use it to improve a model. So just like regular QA testing, your quality of testing now depends on your data coverage (instead of case coverage) - so the better the coverage, the better your testing. Of course, this is easier to say than doing it. And you don’t allow the engineer to use this data for development. But for a live system, you do have a good chance to collect live data and you just need to sample it randomly and make sure the data reflect what you will encounter now or in the future. In addition, you will continue analysis your failing cases and review your data coverage. Of course, then you need to recreate the benchmark again. So the whole testing scenario is data-driven rather than use case driven in certain sense. And the acceptance test is more like whether you have an un-explainable or reasonable drop in performance. ML is far less “deterministic” compare with traditional QA so adjustments are needed.
Good luck and glad you think about that in details. As we move beyond research, those questions are likely significant.