5 Days to 1 Hour: The AI Experimentation Playbook for Product Teams

Three years ago, the experiment required five days to live test, after the hypothesis was launched. Another ten was added through analysis. Between eight and ten experiments annually, with each being handled as a big launch, with numerous sign-offs, a lot of paper work, cross-team co-ordination.

Nowadays the same team conducts experiments in less than an hour and interprets in a day. The first twelve months following the reconstruction of the infrastructure featured twenty tests. The difference between the two realities does not lie in sophistication - it lies in the velocity of experimentation: how fast a team can experiment, learn and iterate.

It was not engineering resources that were the real bottleneck

After mapping the experiment lifecycle, we found that everyone was leaving engineering capacity as the blame. The actual issue lay in the coordination layer. Every experiment was manually configured with custom logging per new metric, treated and control group had to be deployed separately and data had to be pulled manually across various systems before any analysis could be performed.

This was painfully demonstrated by one test. We were interested in testing an advertiser-facing budget recommendation which changed the guidance threshold according to recent performance. Simple idea. Practically, we had to have coordination between the recommendation service, the UI surface, the experimentation scheme of traffic assignment, and the analytics procedure of quantifying spend, conversions, and downstream retention. The conditions of the seasonal marketplace had already changed and the original threshold assumption had already been invalidated before the test was launched. We carried out an experiment that answered a question which we no longer needed to pose.

The coordination tax was massive. It took product managers hours to write specs to be used by engineers who took days to construct infrastructure that could have been auto-employed.

Three infrastructure modifications that made it possible to launch experiments within 1 hour

Three accurate decisions were needed in the shift. First, a self-service experiment system based on templates: product managers set up experiments via a dashboard instead of writing specifications, and the system takes care of assigning variants, distributing traffic, and instrumenting measurements automatically.

Second, the deployment of separation of experiment and feature deployment. The use of feature flags allows the team to deploy code and activate experiments without the need to release more. This one change has removed the most time consuming process in the previous process.

Third, uniform measures facility. Systems were not instrumented to record custom logging on a case-by-case basis, but rather a set of default metrics. These custom metrics might be configured and not coded.

The engineering cost was great - approximately the length of a season - to have a working version, and subsequent hardening follows. The dashboard or feature-flag plumbing was not the most difficult part. It was converging around a common contract of measurement: of what it meant to be successful when using AI-powered advertiser features and that the measurement definitions were consistent across services.

Quickly and more statistically rigorously analyze data

The analysis time was reduced by alteration in workflow rather than just tooling. The team did not wait until experiments were completed before analyzing findings but instead, it looked at the results in real-time using automated scorecards. The automated guardrail measures identified unforeseen regressions in core measures early enough to make quicker decisions to continue, repeat or abort a test.

The initial test of the new system involved two variations of an AI recommendation card, where one displayed the explanation of the why in plain language with confidence qualifier, and the other just displayed the action. It took less than an hour to get off. Signal was received in less than a day. The group had confidence in the process since there was no instrumentation to bargain and no tailor-made analysis to compose. The victory set in motion that first win.

What 20 experiments found out about AI recommendation features

Twenty experiments ran gave more understanding than years of prudent, sophisticated individual experiments. Three discoveries transformed the development of the advertiser-facing AI capabilities by the team:

Adoption is motivated by explanations, but only short and precise ones.

Including a short sentence like why you are reading this and a fact to support it helped in raising the action levels. Explanations were longer and this decreased attendance and dismissals. Trust is made by being clear, not wordy.

Personalization concerns the guardrails but not only the suggestion.

The reaction of agencies and more advanced advertisers was not similar to smaller sellers. The same recommendation was helpful for one segment and ineffective for another. Intent and maturity-based tuning of the recommendation thresholds and filtering logic was required as opposed to predicted lift.

As much as the quality of models, frequency and timing do matter.

The smaller number of recommendations presented at the appropriate time had a greater overall success rate than larger amounts of so-called relevant recommendations appearing too frequently. Disruptions are costly to advertiser processes.

The pressure on any particular experiment was also not ideal and was brought down by high velocity. Smaller and more focused tests are feasible when the launching takes an hour. The statistics are simple: twenty trials with 70% confidence to your hypothesis are better than two trials with 95 percent confidence when you are supposed to learn fast.

The technical change was not as difficult as the cultural one

Initially product managers were reluctant of the self-service model because they feared that they will make errors without engineering scrutiny. Engineers feared the loss of control over the items that were produced. Both were handled by gradual implementation - low risk experiments initially, explicit guidelines on changes that needed further consultations and investment in education on statistical concepts, rather than the use of tools.

This required executive support. The leadership had to use the failed experiments as a lesson and not as a waste of time. That cultural change - of extolling rapid acquisition instead of gradual perfection - was as significant as any change in infrastructure.

Velocity develops a learning flywheel. The data produced by each test is used to inform better hypotheses in the next test. In the case of product teams that operate with AI systems, with user behavior acting upon algorithmic outputs in complicated ways, this compounding effect is no longer a choice, but the single sure way of knowing what actually works.