App DevelopmentbusinessMarketing Analytics app

Marketing Analytics App Development

Build marketing analytics platforms with Scrums.com. Teams for event ingestion, attribution, segmentation, and warehouse architecture. Deploy in 21 days.

Book a demo See pricing

Companies building marketing analytics platforms are engineering data infrastructure, not dashboards. The engineering challenge is creating a system that ingests events reliably from web, mobile, server-side, and ad network sources; resolves cross-device identity for the same user across sessions and channels; applies attribution models consistently across marketing spend that spans dozens of channels; and delivers query results against billions of events within the response time budgets that product managers and marketers expect. The attribution layer is typically where the complexity lives: multi-touch attribution requires a complete event history for every converted user, matched across channels using a resolved identity graph, with deduplication logic that prevents double-counting the same conversion across model comparisons. Whether building an internal marketing analytics platform for a large SaaS or e-commerce company, or a standalone analytics product for agencies and marketers, the underlying data architecture determines whether the platform can answer new attribution questions without re-processing the full event history. Scrums.com builds dedicated engineering teams that ship production-grade marketing analytics infrastructure in weeks, not quarters.

Event Ingestion and Data Pipeline Architecture

The event ingestion layer receives events from three sources: client-side (JavaScript SDK, mobile SDK), server-side (application backend sending purchase events, subscription events), and ad network webhooks (Google Ads conversion uploads, Meta CAPI, TikTok Events API). Server-side events are more reliable and consent-compliant than client-side events (unaffected by ad blockers, ITP, and browser privacy restrictions) and should be the authoritative source for conversion events. Client-side events are valuable for behavioural signals (page views, scroll depth, feature interactions) where server-side attribution is impractical.

Each event must carry a minimum schema: event_id (UUID v4, client-generated for deduplication), event_type, timestamp (client), received_at (server), user_id (if authenticated), anonymous_id (always present, persistent across sessions), session_id, and a properties object. The event schema is enforced by a schema registry at ingestion: unknown event types or properties are rejected or quarantined in a dead-letter queue rather than accepted silently. Silent acceptance of malformed events is how marketing data warehouses accumulate months of inconsistently shaped data that cannot be queried reliably.

Deduplication uses event_id as the idempotency key. Events are written to a raw_events table with a UNIQUE constraint on event_id. Duplicate events (from SDK retries, network partitions, or double-firing tags) are silently dropped on constraint violation, not returned as errors. The constraint must exist in the final warehouse table, not just an upstream buffer, because events can arrive hours out of order on high-latency mobile connections.

Identity resolution stitches anonymous_id to user_id on authentication events. An identity_graph table stores (anonymous_id, user_id, linked_at) pairs. All historical events for a given anonymous_id are retroactively attributed to the resolved user_id via a view join, not by mutating the raw_events table. A pre-login conversion is correctly attributed to the authenticated user's journey without rewriting historical records.

Multi-Touch Attribution and Channel Analytics

The attribution engine is a projection query over the identity-resolved event history. For each conversion event, the engine looks back a configurable attribution_window (stored in attribution_config: default 30 days for paid, 7 days for organic) and retrieves all touchpoints for that user_id within the window. Touchpoints are assigned credit according to the attribution_model (also in attribution_config): last_click (100% to the last non-direct touchpoint), first_click, linear (equal share), time_decay (exponential decay toward conversion), or data-driven (custom weights trained on conversion data).

The attribution_config table is the single point of change for model parameters: switching from last_click to linear does not require a pipeline rewrite; it requires updating the model parameter and re-running the projection. Store attribution_model as a string enum; store model_params as a JSON column. The engine reads config at projection time, enabling historical re-attribution: what would the channel mix look like under linear attribution for the past 90 days is a query, not a data migration.

Channel cost ingestion: ad spend data from Google Ads, Meta, LinkedIn, and TikTok is pulled via their APIs on a daily schedule (or hourly for high-spend campaigns). Cost records land in a channel_costs table: date, channel, campaign_id, impressions, clicks, spend (DECIMAL(12,4)), currency. A daily FX normalisation job converts all costs to the reporting currency using an fx_rates append-only table. The channel_performance materialised view joins channel_costs with attributed conversions to compute CPC, CPL, CAC, and ROAS per channel and campaign, refreshed nightly.

Cross-channel conversion deduplication: a conversion is attributed to exactly one user_id in the conversion_events table. If the same user converts across multiple devices, the identity graph resolves them to a single entity before attribution runs. Ad network self-reported conversions are tracked in a separate ad_network_reported_conversions table and reconciled against the platform's own attribution counts: the delta measures over- or under-reporting by each network.

Scrums.com delivers these analytics platforms through dedicated teams via our mobile app development service.

Audience Segmentation and Customer Lifetime Value

Audience segments are defined as queries, not static lists. A segment_definitions table stores: segment_name, definition_query (SQL or a structured filter DSL), refresh_schedule. Segment membership is materialised by a scheduled job that executes the definition_query and writes results to segment_memberships (user_id, segment_id, added_at, removed_at). Membership changes are append-only: a removed user gets a removed_at timestamp, not a deleted row. This preserves historical membership for cohort analysis.

CLV computation has two components: historical CLV (sum of margin from all completed orders for that user, updated by a trigger on each order event) and predicted CLV (a projection model trained on historical purchase sequences). Predicted CLV uses a Pareto/NBD or BG/NBD model for non-contractual businesses (e-commerce, marketplaces) or a renewal probability model for subscription businesses. Model inputs (recency, frequency, monetary value, age) are pre-computed nightly from order_events and stored in a customer_rfm_snapshot table. CLV scores are stored in customer_clv_scores with a calculated_at timestamp; stale scores are flagged in the freshness monitor.

Cohort analysis requires that every user record carries a cohort_date (the date of first conversion or first event, depending on the analysis type). This field must be set at user creation and never mutated. Cohort retention reports compute retention_rate(cohort_date, period_n) as a projection over conversion_events, never as pre-aggregated counts, because the definition of active or converted changes frequently in early-stage products.

Behavioural segmentation for ad network audience sync: segment_memberships is the source for lookalike and retargeting audience uploads. A sync job exports hashed user identifiers to Google Customer Match, Meta Custom Audiences, and LinkedIn Matched Audiences on a configurable schedule. Export records land in audience_sync_log: segment_id, network, export_id, exported_at, user_count, and status (pending, accepted, rejected). Rejected exports retry with exponential backoff. Dedicated engineering teams from Scrums.com build these segmentation and CLV architectures from the warehouse schema to the ad network sync.

Real-Time Reporting Infrastructure and Data Freshness

Marketing analytics reports fall into two latency tiers: operational (sub-second: live campaign CTR, real-time conversion count) and strategic (minutes to hours acceptable: CAC by channel last quarter). The same database cannot serve both tiers without degrading one. The correct architecture: ClickHouse or Apache Druid for operational queries (columnar storage, sub-second aggregation over billions of events); dbt + BigQuery/Snowflake/Redshift for strategic queries (SQL-first transformations, scheduled materialised views, cost-optimised for batch).

Dashboard query optimisation: every dashboard widget that aggregates over a large table must query a pre-aggregated summary table, not the raw event table. Summary tables are computed by a dbt model scheduled at the minimum acceptable staleness for that metric (hourly for campaign-level CTR, daily for cohort retention). The summary table is the data contract between data engineering and the front end; the front end never queries raw_events directly. This decouples dashboard performance from raw data volume growth.

Data freshness SLAs: a data_freshness_monitor table stores for each critical table: expected_lag_minutes, alert_threshold_minutes, last_successful_update_at. A scheduled job checks every 15 minutes and fires an alert (PagerDuty, Slack) when expected_lag is exceeded. Dashboard widgets display a data-as-of timestamp derived from the freshness monitor; users see the age of the data they are viewing, not an implicit assumption that it is current.

Marketing data warehouse governance: a data_catalog stores every dbt model's column definitions, business descriptions, and owner. Column-level lineage is generated by dbt's documentation output. When a column name changes in a source table, the lineage graph identifies every downstream dbt model and dashboard that will break, enabling proactive communication before breakage occurs. Get a dedicated team implementing this analytics infrastructure end to end; start a conversation with Scrums.com.

Frequently Asked Questions

How do we handle duplicate events from client SDK retries?

Use client-generated UUID v4 event_ids as the idempotency key. Write events to the raw_events table with a UNIQUE constraint on event_id: duplicate events are silently dropped on constraint violation, not returned as errors. The constraint must be in the final warehouse table, not just an upstream buffer, because mobile events can arrive hours out of order.

What is the correct pattern for retroactively attributing conversions to an authenticated user?

Store anonymous_id-to-user_id mappings in an identity_graph table with linked_at timestamp. All historical events for a given anonymous_id are attributed to the resolved user_id via a view join over the raw_events table, never by mutating historical event records. This means a pre-login conversion is correctly attributed to the authenticated user's journey without rewriting the event log.

How do we enable switching attribution models without a pipeline rewrite?

Store the attribution_model parameter and model_params JSON in an attribution_config table. The attribution engine reads config at projection time. Changing from last_click to linear requires updating the config row and re-running the attribution projection over the lookback window: no pipeline code changes required.

How do we prevent raw event table queries from degrading dashboard performance at scale?

Pre-aggregate all dashboard metrics into summary tables via scheduled dbt models. Dashboard widgets query the summary table, never raw_events. The summary table is the data contract: it is refreshed on a defined schedule (hourly, daily) and exposes a last_updated_at column that feeds the data-as-of indicator in the UI.

How do we reconcile our platform's conversion counts against ad network self-reported numbers?

Maintain two separate tables: platform-attributed conversions (from your own event log and attribution engine) and ad_network_reported_conversions (from Google Ads, Meta CAPI import). A reconciliation report joins both by date, channel, and campaign and computes the delta. The delta is your measure of each network's over- or under-reporting, informing how much to trust their self-reported ROAS figures.

+ READY TO BUILD

Build your Marketing Analytics app with Scrums.com

Build marketing analytics platforms with Scrums.com. Teams for event ingestion, attribution, segmentation, and warehouse architecture. Deploy in 21 days.

Book a demo →Explore all app types

DEDICATED TEAMS · OPERATED DELIVERY · FIRST SPRINT IN 21 DAYS