Evidence-Based AI Marketing Engine

Aleksandr Bagrov

What if every creative decision in your ad campaign had to be backed by evidence? I describe the architecture of an AI marketing engine that decomposes campaigns into probability chains, identifies funnel bottlenecks, generates testable hypotheses, and learns from every experiment — using Contribution Margin as its North Star.

What It Is and What It Is For

At some point, I caught myself thinking about a simple problem: in B2C marketing, far too many decisions are still made at the level of taste, intuition, and gut feeling.

Sometimes this actually works. Strong marketers have visual awareness, a feel for the audience, an understanding of the product, and experience from previous launches. But if we want to build a system that scales, repeats successful discoveries, and learns from mistakes, intuition alone is not enough.

I started thinking about a marketing campaign as a funnel that can be broken down into a sequence of events, metrics, features, and decisions.

A user sees an ad. Stays through the first few seconds. Watches until the middle. Reaches the CTA. Clicks. Lands on the page. Spends time there. Makes a purchase. Brings revenue. Brings margin.

In this logic, advertising becomes a system of causes, effects, and testable hypotheses.

That is how the idea of the Evidence-Based AI Marketing Engine emerged: an AI engine that helps improve the marketing funnel based on data, evidence, and controlled experiments.

"p(click)""p(hook_pass)""p(mid_retention)""p(end_reached)""p(end_click)""p(purchase/click)""x orders""x margin rate""- fixed costs"Ad ImpressionAd Click"Hook Passed(first 3s watched"Mid Retention(~50% watched"End Reached(100% watchedEnd CTA ClickPurchaseExpected AOVExpected Variable MarginExpected CM

Marketing Strategy as a Funnel

I start from the assumption that any marketing campaign can be represented as a chain of steps:

  • ad impression,

  • click,

  • passing the first few seconds,

  • retention until the middle,

  • watching until the end,

  • click on the final CTA,

  • transition toward purchase,

  • the purchase itself,

  • average order value,

  • variable margin,

  • final Contribution Margin.

When we look at advertising this way, the task changes. I need to understand where exactly the funnel is losing money, which features affect that step, and which change will produce the greatest impact at a reasonable testing cost.

I use Contribution Margin as the North Star metric, because it reflects the business outcome most directly.

Two Key Tasks the System Must Solve

I divide the work of the AI marketing engine into two modes.

The first mode is new campaign generation. Here, the system needs to search for fresh ideas, look at the external context, analyze competitors, capture trends, and propose new funnels. In this mode, it is dangerous to stay locked inside historical data. If the system only sees the past, it will keep reproducing familiar solutions.

The second mode is optimization of an existing funnel. This task is more engineering-oriented. We already have a campaign, a funnel structure, a set of creatives, landing pages, audiences, and metrics. The task is to identify the bottleneck, determine which changes are allowed, estimate their cost, predict the expected effect, and understand how to test the hypothesis.

In this article, I focus specifically on the second mode: how to build a system that improves an existing marketing funnel through evidence-based hypotheses.

Implementing the Funnel Optimization Task

My approach is built around a self-learning cycle.

The system receives raw campaign data, cleans it, enriches it with features, detects patterns, forms hypotheses, launches tests, reads the results, and updates its memory. After each iteration, it understands better which decisions help, which ones hurt, and under which conditions the effect changes.

Inside this cycle, there are several key blocks.

Data Quality & Margin Truth Layer

The first layer is responsible for data quality and the correctness of margin calculation.

This is the foundation of the entire system. If data on costs, revenue, funnel events, and attribution differs across sources, AI will generate elegant hypotheses on top of a weak foundation.

The task of this layer is to bring the data to a state where the system can trust it. Here, the system checks missing values, duplicates, contradictions, time lags, and the correctness of links between impressions, clicks, orders, and margin.

External Trend & Context Radar

The next layer analyzes the external environment: competitors’ ad campaigns, trends, audience context, new formats, visual patterns, news hooks, and changes in user behavior.

This block is needed so the system does not get trapped inside its own launch history. The external radar helps it understand which new ideas have appeared around the product, category, platform, or audience.

100%
No diagram code provided

After the raw source data has been collected, the system needs to prepare a comprehensive set of derived features that it will later rely on when forming hypotheses.

Customer Voice Intelligence

This layer turns user feedback into features. Comments, emails, likes, reposts, negative reactions, questions under videos, repeated phrasing, doubts, and objections can all be used as material for analysis.

For example, a user may accidentally formulate a stronger hook in a comment than the internal company team. Or a cluster of comments may reveal that people are responding to an argument that is not present in the ad at all.

Customer Voice Intelligence should extract these signals and convert them into data that the system can work with later.

Experimentation & Causal Layer

This block is needed to find relationships in the data that the system will need in the future. Statistical and analytical algorithms can be applied here, and additional features can also be calculated.

Creative Intelligence Graph

This block looks at the creative as a structured multimodal object.

For the system, a video should be decomposed into understandable elements:

  • hook,

  • body,

  • CTA,

  • visual structure,

  • speaker style,

  • tone,

  • pace,

  • subtitles,

  • promises,

  • proof,

  • emotional arc,

  • format,

  • platform,

  • audience hypothesis,

  • linked landing page,

  • final business result.

It is important to me that the system looks at the campaign through the user’s eyes. It needs to see which moments create interest, trust, skepticism, urgency, relief, or the desire to try the product.

This is the level at which it becomes possible to say: “In this cluster of videos, the product appears too late, and the CTA sounds like a mechanical insertion.”

Post-Mortem Learning Agents

After each test, the system needs a layer that analyzes the result.

Did the hypothesis work or not? Was the effect higher or lower than expected? Did the metric grow because of the intended funnel step, or because of a side effect? Did the improvement in hook_pass_rate lead to growth in CM, or did it simply attract cheaper clicks without purchases?

Post-mortem agents record the conclusions and save them into the system’s memory. This way, every launch becomes part of accumulated experience.

For me, this is one of the most important elements. Without post-mortem, the system will endlessly generate new ideas. With post-mortem, it starts to learn.

Hypotheses Swarm

This is the core of the entire architecture.

Hypotheses Swarm receives cleaned data, features, analytics, similar cases, brand constraints, and results from previous launches. After that, it generates a package of hypotheses for improving the funnel.

Here, it is important to limit the role of LLM agents. Their task is to gather evidence, propose a specific change, explain the expected effect, identify the risk, and describe the test design.

In my logic, a good hypothesis looks like a testable action. For example:

“Move the product demonstration into the first 1.5 seconds and replace the opening with a problem-first hook. Expected effect: hook_pass_rate growth of 8–12%, end_click_rate growth of 2–4%, expected CM growth of 6–9%. Risk: an aggressive opening may reduce the perception of premium quality. Test: A/B, primary metric: CM per 1,000 impressions, guardrails: purchase CVR and refund rate.”

This kind of hypothesis can already be launched, discussed, calculated, and tested.

Ad Generative Pipeline

This block is responsible for producing new creative variants. In the ideal version, it is a content factory made of agents that can create new versions of videos, texts, landing pages, subtitles, and visual elements based on selected hypotheses.

Architecturally, however, this block can be replaced by a manual process. The main requirement is that every new version of the campaign must be connected to a specific hypothesis and a clear set of changes. If the team changes 10 elements at once without recording the reason, it becomes difficult to understand what exactly influenced the result.

I should also note that nothing prevents the team from launching several campaign tests in parallel. The key point is that each test must be linked to testable hypotheses.

Multimodal Compliance & Quality Gate

Before launch, new materials must pass a quality and compliance check.

This is where the system checks claims made in the ad, visual elements, alignment with brand policy, legal restrictions, generation quality, and consistency between promises and the landing page.

For AI marketing, this is especially important. Generative models can sound convincing while adding claims that the brand is not ready to defend. That is why the compliance gate must be a mandatory layer.

Launch & Budget Experiment Governor

The final layer is responsible for launching experiments. It selects hypotheses, allocates budget, defines testing rules, monitors guardrail metrics, and stops variants that start to harm campaign economics.

A test launch should not be chaotic. Each hypothesis has an expected effect, risk, cost of change, test design, and stopping criteria.

System Summary

As a result, we get the following cycle: based on the available data from the initial campaign launch, the core generates hypotheses for funnel optimization. Then, according to these hypotheses, forks of the marketing campaign are created, automatically launched, and tested.

Based on the test results, the data is collected again, the hypotheses are validated, and new hypotheses are formed from the new data. New campaign forks are then generated and tested as well.

The process continues until the target indicators are achieved or the system exhausts its testing resources.

How an Evidence-Based Hypothesis Is Born

Now to the most important part: the mechanism for generating hypotheses.

100%
No diagram code provided

I start from the assumption that the system already has reliable enriched data. It knows the class of campaign, the structure of the funnel, the entities inside the campaign, the related metrics, and the North Star metric.

  1. We rank features by the degree of their impact on the North Star metric — in this case, Contribution Margin, or CM. After that, the system generally understands which features it will need to work with.
    An example JSON object showing how a list of features with their impact on the North Star metric may look:

    Code Block
    JSON
    {
    "local_negative_drivers": [
    {
    "feature": "product_visible_before_2s=false",
    "model_effect_on_CM": "-7.4%",
    "global_rank": 3,
    "segment_rank": 1,
    "actionable": true
    },
    {
    "feature": "proof_present=false",
    "model_effect_on_CM": "-4.1%",
    "global_rank": 12,
    "segment_rank": 4,
    "actionable": true
    },
    {
    "feature": "caption_density=high",
    "model_effect_on_CM": "-2.8%",
    "global_rank": 25,
    "segment_rank": 7,
    "actionable": true
    }
    ]
    }


  2. We map the ranked features to each step of the funnel.
    Product visible before 2s affects hook_pass_rate. Proof timing can affect mid_retention and end_click. CTA framing affects end_click and the quality of the transition. Caption density can reduce retention in the middle of the video.
    This is how the system understands where exactly the funnel is losing efficiency.

  3. We identify bottlenecks: funnel steps where influencing certain features will produce the strongest qualitative result with minimal effort. Of course, for this to work well, the system needs to understand which features it can influence and which it cannot. I will return to this below.

  4. For the identified bottlenecks, the system creates Evidence Cards. These are entities that contain proposals for specific atomic actions, supported by an evidence base explaining why the action should help.
    Here is an example JSON object:

    Code Block
    JSON
    {
    "pattern": "product_visible_before_2s",
    "current_value": false,
    "suggested_value": true,
    "evidence": {
    "similar_cases_count": 143,
    "median_CM_lift": "+8.2%",
    "median_hook_pass_lift": "+11.5%",
    "downstream_purchase_CVR_change": "+0.7%",
    "confidence": "medium-high"
    },
    "examples": [
    {
    "video_id": "hist_882",
    "difference": "same hook type, product shown at 1.2s",
    "CM_percentile": 84
    },
    {
    "video_id": "hist_217",
    "difference": "product appears only after 6s",
    "CM_percentile": 31
    }
    ]
    }


  5. For the identified Evidence Cards, the system prepares a list of actions that need to be performed on the current marketing campaign so the metrics change in line with the Evidence Card.

    In essence, this is a list of ideas for how to change the current funnel, what to add to it, which aspects to adjust, and so on, in order to get the desired result.

    If the problem is that the product appears too late, the system may suggest:

    • moving the existing product demonstration to the beginning,

    • adding a packshot overlay in the first 2 seconds,

    • reshooting the opening with the product,

    • rewriting the hook so the product becomes part of the first frame.

    This is what a JSON object with such proposed changes may look like:

    Code Block
    JSON
    {
    "feature": "product_visible_before_2s",
    "possible_actions": [
    {
    "action_id": "move_product_demo_to_opening",
    "edit_type": "reorder_scene",
    "cost": "low",
    "risk": "low",
    "target_stage": "hook"
    },
    {
    "action_id": "add_packshot_overlay_first_2s",
    "edit_type": "overlay",
    "cost": "low",
    "risk": "medium",
    "target_stage": "hook"
    },
    {
    "action_id": "reshoot_opening_with_product",
    "edit_type": "reshoot",
    "cost": "high",
    "risk": "medium",
    "target_stage": "hook"
    }
    ]
    }


  6. All information produced during this preprocessing stage is passed to agents, which then form the final list of hypotheses.

    Here is an example of the object passed as input:

    Code Block
    JSON
    {
    "current_video_summary": {
    "creative_type": "UGC educational ad",
    "product": "book",
    "audience": "cold Meta Reels audience",
    "duration": "24s",
    "narrative": "problem -> explanation -> product -> CTA"
    },
    "stage_diagnosis": {
    "main_bottleneck": "hook",
    "secondary_bottleneck": "end CTA",
    "guardrail": "do not reduce purchase CVR by attracting low-quality clicks"
    },
    "local_negative_drivers": [
    "product not visible before 2s",
    "proof appears too late",
    "CTA lacks offer framing",
    "caption density too high in middle"
    ],
    "historical_evidence_cards": [
    "...",
    "..."
    ],
    "allowed_actions": [
    "move existing product scene earlier",
    "rewrite hook text",
    "add social proof overlay",
    "add offer before CTA",
    "reduce middle section by 3 seconds"
    ],
    "brand_constraints": [
    "no exaggerated claims",
    "premium but direct tone",
    "no fake urgency"
    ],
    "output_schema": {
    "hypothesis": "string",
    "target_stage": "string",
    "edit_instruction": "string",
    "expected_metric_delta": "object",
    "expected_CM_delta": "string",
    "evidence": "array",
    "risk": "string",
    "test_design": "string"
    }
    }


  7. Based on this data, the agents form a list of hypotheses. Each hypothesis is an object like this:

    Code Block
    JSON
    {
    "hypothesis": "Move the product demonstration into the first 1.5 seconds and replace the abstract opening with a problem-first hook",
    "target_stage": "hook_pass_rate",
    "expected_effect": {
    "hook_pass_rate": "+8–12%",
    "end_click_rate": "+2–4%",
    "expected_CM_delta": "+6–9%"
    },
    "evidence": [
    "Across 143 historically similar videos, having the product visible before 2 seconds correlated with higher CM",
    "In 38 contrastive pairs for similar products, a problem-first hook delivered better downstream click quality",
    "The current video is in the bottom 20% for hook_pass_rate compared to a similar cluster"
    ],
    "confidence": "medium-high",
    "risk": "It may reduce perceived premium quality if the opening becomes too aggressive",
    "test": "A/B test: current version vs. variant A; primary metric: CM per 1,000 impressions; guardrails: purchase CVR and refund rate"
    }


The resulting list of hypotheses can be ranked by different criteria. Other models can act as scoring agents, but we can also use scoring based on strict expected performance metrics for each hypothesis:

After scoring, the system selects the top_n hypotheses and passes them into production or to the manual team.

As a result of this process, the output is a set of hypotheses that are already concrete actions to be applied to the existing marketing campaign, with an expected effect. In addition, each hypothesis is supported by an evidence base and a conclusion explaining why it should work.

But where does this evidence base come from, and which tools are used to form it? More on that below.

Where Does the Evidence Base Come From?

Evidence Cards should not appear out of thin air. I see several analytical layers that can feed this system.

Predictive Layer

With a sufficient volume of historical data, we can train simple predictive models to forecast values for each step of the funnel.

For example, for the funnel:


We get a family of models:


These models help estimate how a change in a feature may affect the funnel. For example, what will happen to hook_pass_rate and expected CM if we shorten the opening, add proof earlier, or change CTA framing.

Causal / uplift Layer

Predictive models provide a forecast. Causal and uplift approaches help estimate the effect of an action more carefully.

We divide features in historical data into classes.

  1. Actionable features
    Features that the system can update and change:

    • hook style,

    • CTA,

    • pacing,

    • product shot,

    • subtitles,

    • offer framing.

  2. Context features
    Unchangeable features — features the system cannot influence within the current marketing campaign:

    • platform,

    • placement,

    • product,

    • price.

  3. Confounders
    External features that can influence experiment results:

    • budget — if two experiments are run with different budgets, their results will also differ,

    • seasonality,

    • retargeting / warm audience,

    • external environment, for example Covid-19.

  4. Outcome features
    These features are campaign results and cannot be changed by the system:

    • clicks,

    • conversions,

    • revenue.

After that, we can give the system several heterogeneous algorithms for analyzing feature impact and identifying relationships between features. For example, using the following approaches:

  • DoWhy — a Python library for causal inference that supports explicit modeling and testing of causal assumptions.

  • EconML — an open-source Python package developed by the ALICE team at Microsoft Research for estimating individualized causal responses.

  • SHAP — an approach for explaining model outputs using Shapley values.

These methods allow us to identify the influence of features and the relationships between them. This is critically important for the system when building Evidence Cards.

Retrieval Layer

The Retrieval Layer searches for similar cases in historical data.

Contrastive pairs are especially valuable: similar campaigns where one important difference led to different results. For example, two videos with a similar product, format, and audience, but in one video the product is shown at 1.2 seconds, while in the other it appears only after 6 seconds.

Such pairs provide material for Evidence Cards. They help the system see not only successful decisions, but also decisions that performed weaker under similar conditions.

Of course, each of these data layers is itself a fairly complex data analysis solution. But together, they give the system the flexibility it needs when forming hypotheses.

Why the System Needs LLM Agents at All

LLM agents are useful where analytics, creativity, and the language of action need to be connected.

A model can read Evidence Cards, assemble patterns, account for brand constraints, propose different editing options, formulate a test, and prepare a clear explanation for the marketer.

The agent operates inside the evidence-based loop and gives the system the flexibility it needs.

Where I See the Main Value of the Approach

I like this architecture because of its practical discipline. It forces us not to argue about creative taste, but to ask more useful questions:

  • which funnel step are we improving,

  • which feature do we want to change,

  • which Evidence Card supports this decision,

  • what effect do we expect,

  • what risk are we accepting,

  • how will we know that the hypothesis worked,

  • what should the system remember after the test.

This makes marketing closer to an engineering process. There is still room for creativity, but creativity becomes connected to data, experiments, and economics.

What Still Needs to Be Done

This approach still has open questions.

The first question is what full set of features is needed for different classes of campaigns. The feature set will differ for UGC videos, landing pages, email funnels, performance creatives, and product demos.

The second question is how deeply each architectural block should be designed. Data Quality, Creative Intelligence Graph, Retrieval Layer, and Experiment Governor can each become large systems in their own right.

The third question is how to adapt the method to different types of funnels. I describe the approach using marketing as an example, but the logic itself can be applied more broadly. Wherever there is a sequence of events, conversions, actions, historical data, and a measurable result, it is possible to build a similar evidence-based engine.

For me, the Evidence-Based AI Marketing Engine currently looks like a working framework. It is a way to bring AI, analytics, creativity, experiments, and a business metric into a single system. It is a mechanism that gradually learns to understand why a specific marketing funnel earns more or less money, and which actions are actually worth testing next.

Subscribe to our newsletter.
Get product updates and news in your inbox. No spam.
By submitting your email address, you agree to receive product updates. You can always withdraw your consent.
1 MIN READ
Заголовки не найдены