App DevelopmentbusinessMarketing Attribution app

Marketing Attribution App Development

Build marketing attribution platforms with Scrums.com. Teams for attribution pipelines, identity resolution, and consent-aware routing. Deploy in 21 days.

Book a demo See pricing

Companies building attribution platforms face the hardest data engineering problem in martech: connecting ad spend to revenue across channels where each vendor claims credit, identities fragment across devices, and third-party cookies no longer close the loop. Scrums.com engineers dedicated attribution platform teams for growth-stage SaaS, adtech vendors, and FinTech companies that need to own their measurement stack rather than rent it from a vendor with conflicting incentives.

Attribution Data Pipeline Architecture

The foundation is a canonical event ingestion layer: a first-party server-side collection endpoint that accepts page views, product events, and conversion signals, normalises them to a versioned schema, and appends them to an immutable raw_events ledger in the data warehouse (BigQuery, Snowflake, or Redshift). The collection endpoint bypasses ad blockers and extends first-party cookie lifetime to 2 years: critical for accurate attribution in a cookieless environment.

Every event carries a session_id, anonymous_id, user_id (post-authentication), utm_* parameters, and a consent_categories array so that downstream routing respects GDPR/CCPA lawful basis per event type.

A streaming enrichment layer (Kafka + Python consumer) resolves channel metadata (campaign name, ad set, creative) by joining against the campaign registry. Enriched events are forwarded in parallel to a destination forwarder that hydrates the Meta Conversions API, Google Ads Enhanced Conversions, and any server-side GTM tag in near-real-time, without duplicating the browser pixel. This architecture ensures that paid channel measurement continues accurately even after platform-enforced signal loss.

Attribution models are implemented as dbt models that transform the raw event ledger into channel-credited conversion rows. Keeping attribution in the warehouse (never computed on live query paths) means model changes are zero-downtime: replace a dbt model, re-run the materialisation, and the updated credits are available to the BI layer without touching application code.

Identity Resolution and Stitching

Attribution accuracy depends entirely on the quality of the identity graph. A user who sees a LinkedIn ad on their work laptop, clicks a retargeting ad on mobile, and converts on desktop will appear as three anonymous users unless the system can stitch them into one journey.

The identity graph stores edges between identifiers (anonymous_id, email, phone (E.164), device_id, hashed_email) with a confidence_score (0.0 to 1.0) and a match_reason enum (deterministic_email, deterministic_phone, probabilistic_device_graph, cross_device_vendor). Edges below 0.85 confidence are flagged with a review_required flag rather than merged automatically. Every merge decision is written to a merge_audit log so that erasure requests can cascade correctly and identity graphs can be unwound if a false match is detected.

Deduplication uses a source_event_id + source_system composite key at ingestion: a pixel event and a CRM-recorded conversion for the same user action resolve to a single credited conversion rather than double-counting. The deduplication record is retained even after GDPR erasure (stripped of PII but preserving the dedup key) so that re-ingestion of the same event cannot create phantom conversions.

Cross-device identity enrichment from third-party vendors (LiveRamp, Tapad) is treated as probabilistic and given lower confidence scores than deterministic first-party signals. A device_sharing_flag is set on edges where multiple high-confidence identities are detected on a single device, preventing household members from contaminating each other's attribution paths. Scrums.com builds this identity resolution infrastructure through our mobile app development and data engineering service.

Attribution Models and Revenue Measurement

Most teams outgrow last-click attribution during their first budget review: they just do not always know what to replace it with. The attribution engine supports a family of models as composable dbt transformations.

Rules-based models (first-click, last-click, linear, time-decay, position-based) are computed for all teams as a baseline. They are computationally cheap, interpretable, and useful for detecting directional trends even when sample sizes are small.

Data-driven attribution (Shapley value decomposition or Markov chain transition model) requires minimum data thresholds before deployment: Shapley needs at least 3,000 to 5,000 labelled conversions and a 6-month conversion window baseline; Markov needs roughly 1,000 conversion paths. Building a model before these minimums are reached produces credibility-destroying artefacts. The system gates data-driven model activation on rolling window checks and falls back to rules-based during ramp periods.

Revenue attribution is never stored as a field on the campaign record. Instead, a materialised channel_revenue_attribution table in the warehouse holds one row per conversion event per model, with the attributed revenue amount (NUMERIC(19,4) to prevent floating-point rounding errors) and the model version. Finance teams access attribution data via BI tools against this table. Marketing teams and campaign platforms receive aggregated metrics via a thin API layer backed by Redis cache with tenant-keyed TTLs and event-driven invalidation on model refresh.

Holdout group support (withholding a percentage of audience from all paid channels) is implemented at the audience segmentation layer and holdout membership is recorded at assignment time, not continuously evaluated, so that incremental lift can be computed cleanly in the post-campaign analysis.

Privacy-First Architecture and Consent Infrastructure

Attribution platforms handle personal data across multiple jurisdictions. Engineering decisions made early become expensive compliance liabilities later.

Consent is tracked as a per-event field, not a per-user flag. Each raw event carries a consent_categories array representing what the user had opted into at the moment the event was collected. A routing_policy configuration table maps (consent_category, destination) pairs to routing decisions (allow, block, anonymise). This means that a GDPR-resident user who has consented to analytics but not advertising will have their events routed to BigQuery but blocked from the Meta CAPI destination: without any application code change, because routing is driven by the policy table. Changes to consent logic are configuration deployments, not code deployments.

GDPR erasure requests are handled via a data_subject_request workflow (Temporal orchestration): the DSR triggers a cascade that pseudonymises the user's PII across raw_events, identity_graph, and any enriched tables, writes a consent_block audit record, and notifies downstream destinations via their erasure APIs. The deduplication key is retained as a hash. A 30-day fulfilment SLA with a status field on the DSR record satisfies ROPA documentation requirements.

For CCPA, the do_not_sell flag on the identity record blocks enrichment from third-party data vendors and prevents hashed-email transmission to advertising platforms. The suppression record is append-only so that re-consent can be recorded accurately.

Server-side tagging via a self-hosted server-side GTM container (Cloud Run or Fargate) keeps the marketing data flow inside the company's data boundary. The container logs all outbound tag firings with timestamp and consent state, providing the audit trail required for demonstrating GDPR accountability to supervisory authorities.

Ready to build an attribution platform on infrastructure your team owns? Start a conversation or explore our dedicated team model.

Frequently Asked Questions

How do you handle attribution across channels that do not share conversion data (such as TV, out-of-home, or offline events)?

Offline and non-digital channels are ingested via manual upload or API (TV airings from a media agency export, point-of-sale conversion events via EDI/webhook) and assigned a synthetic channel_id in the campaign registry. Their events participate in the Markov chain model as additional nodes. Incremental lift studies (with matched holdout regions for geographic-based channels) provide a causal check on the Markov model's channel credit for offline spend. Results are stored in the same channel_revenue_attribution table so finance has a unified view.

What prevents the attribution platform from being manipulated by click fraud or invalid traffic?

The collection endpoint filters bot traffic using a combination of user-agent analysis, request rate limiting per IP range, and a honey-pot event type that legitimate browsers never fire. Suspicious events are tagged with an invalid_traffic_flag and sequestered in a quarantine partition of raw_events rather than deleted: this allows forensic review and model exclusion without permanent data loss. Click fraud signals from advertising platforms (Google Invalid Click report, Meta Suspicious Activity API) are ingested nightly and used to retroactively flag attributed conversions that originated from known-invalid sources. Affected attribution rows are versioned, not overwritten, so historical model outputs remain auditable.

When does it make sense to invest in data-driven attribution versus sticking with rules-based models?

Rules-based models are correct to use until there is enough conversion volume for data-driven models to outperform them out-of-sample. Deploy last-touch for optimisation signals sent back to paid platforms (they operate on recency signals); deploy linear or time-decay for internal budget allocation reporting; deploy Shapley or Markov only after accumulating 3,000 to 5,000 qualified conversions with consistent UTM hygiene. A model trained on noisy data will confidently mis-attribute spend. The cost of premature data-driven attribution is higher than the cost of staying rules-based one quarter longer.

+ READY TO BUILD

Build your Marketing Attribution app with Scrums.com

Build marketing attribution platforms with Scrums.com. Teams for attribution pipelines, identity resolution, and consent-aware routing. Deploy in 21 days.

Book a demo →Explore all app types

DEDICATED TEAMS · OPERATED DELIVERY · FIRST SPRINT IN 21 DAYS