A.I. systems are complex. They are trained on massive datasets and adapt their behavior on the fly. Their real impact on users is often unclear until the system is deployed in the real world. In this sense, the field of A.I. is similar to medicine, where the impact of a drug is not know until it is tested on real people. Yet, while in medicine FDA demands that every drug undergoes a rigorous evaluation via a randomized clinical trial, most A.I. practitioners employ very weak evaluation procedures that rely on offline metrics such as precision, recall, AUC, etc.
In this talk, I will discuss issues resulting from this lack of rigor, and show how A/B testing can help detect bias, catch performance issues, obtain deeper insights, and ensure better user experience when developing and deploying A.I. systems.
I will share examples, pitfalls and lessons learned from over a decade of my experience of applying A/B testing to evaluate A.I. systems at Outreach and Microsoft. These lessons will help anyone who wants to apply A/B testing to A.I. scenarios do it correctly and maximize their learnings.
Pavel Dmitriev (OutReach)
VP of Data Science at Outreach, where he works on enabling data driven decision making in sales through experimentation and machine learning. He was previously a Principal Data Scientist with Microsoft Analysis and Experimentation team, where he worked on scaling experimentation in Bing, Skype, and Windows OS. His work was presented at a number of international conferences such as KDD, ICSE, WWW, CIKM, BigData, and SEAA. Pavel received a Ph.D. degree in Computer Science from Cornell University