Question 1

How do you architect WebSocket presence at scale without database polling?

Accepted Answer

The correct approach is an in-memory presence store in Redis. Each connected client sends a heartbeat (a lightweight message) to the gateway server every 15-30 seconds. The gateway server batches heartbeats and writes them to a Redis hash or sorted set keyed by workspace_id, with user_id as the field or member and the last heartbeat timestamp as the value. Presence readers compute online/offline status by comparing the stored timestamp against the current time minus the TTL. The gateway server also writes an offline entry when a WebSocket connection closes (normal disconnect), providing instant offline signalling without waiting for the TTL to expire. At scale, the key optimisation is batching: a gateway server handling 10,000 connections should write one Redis pipeline call per interval containing all batched heartbeats, not one call per connection.

Question 2

Should we use OT or CRDTs for document co-editing?

Accepted Answer

For most SaaS collaboration tools, CRDTs are the better choice. Operational Transformation has been implemented correctly in production by a small number of teams (Google, Apache Wave) and the edge cases in concurrent operation transformation are subtle and difficult to test exhaustively. CRDTs (specifically Yjs or Automerge) provide mathematical correctness guarantees for convergence and have mature open-source implementations with active maintenance. The trade-off is document state size: CRDT documents retain tombstones for deleted content, so very long-lived collaborative documents grow larger over time. The standard mitigation is periodic document compaction (producing a new CRDT snapshot that discards unreachable tombstones), run as a background job when the document has no active editing sessions.

Question 3

How do you implement SCIM 2.0 so enterprise IT teams can automate user lifecycle management?

Accepted Answer

A compliant SCIM 2.0 endpoint requires: a /Users endpoint supporting GET (list with filter), GET by ID, POST (create), PUT (full replace), and PATCH (partial update with RFC 7644 operations array); a /Groups endpoint for group-to-role mapping; a /ServiceProviderConfig endpoint declaring your supported SCIM features; and Bearer token authentication (a long-lived token generated per enterprise customer, not OAuth). The most common implementation errors are: (1) not handling the PATCH operations array correctly: each operation has op (add/replace/remove), path, and value, and the server must process them in order; (2) not returning the correct resource location in the Location header on POST; (3) deprovisioning users with a DELETE rather than a PATCH setting active=false, which destroys content history and breaks audit trails. Test against Okta's SCIM tester tool before claiming enterprise readiness.

Question 4

How should the notification routing system decide between real-time push, mobile push, and email digest?

Accepted Answer

The routing decision is a function of two inputs: the recipient's current presence state and their notification preference for the event type. Presence state is read from the Redis presence store at notification dispatch time (not at event creation time, to reflect the recipient's current state). If the recipient is online (heartbeat within TTL), dispatch via WebSocket push and mark the notification as delivered. If the recipient is offline and has a mobile device registered, dispatch via APNs/FCM push and mark as push-delivered. If neither condition is met, add the notification to the user's digest queue for the next digest window. Each notification must have a unique idempotency key (notification_id) checked before dispatch to prevent duplicate delivery if the routing job re-processes an event.

Question 5

What does SOC 2 Type II compliance require from a remote work SaaS platform's engineering team?

Accepted Answer

SOC 2 Type II requires evidence that controls were operative over a continuous observation period (typically 6-12 months), not just that they exist at a point in time. The controls most relevant to a collaboration platform are: access control (role-based access with quarterly access reviews, MFA enforcement for administrative access, off-boarding process with documented evidence), change management (all production changes deployed via a reviewed and approved process: git PR with approval, CI/CD pipeline logs as evidence), availability (uptime monitoring with SLA measurement, incident post-mortems for downtime events), and confidentiality (encryption at rest and in transit, key management documentation, data classification policy). The engineering team's contribution is the audit trail: every deployment, access grant, configuration change, and security incident must produce a timestamped log entry that the auditor can review as evidence.

Remote Work App Development

Real-Time Collaboration Infrastructure

Async Communication and Notification Routing

Enterprise Identity, SSO, and Multi-Tenancy

Analytics, Data Residency, and Compliance

Frequently Asked Questions

How do you architect WebSocket presence at scale without database polling?

Should we use OT or CRDTs for document co-editing?

How do you implement SCIM 2.0 so enterprise IT teams can automate user lifecycle management?

How should the notification routing system decide between real-time push, mobile push, and email digest?

What does SOC 2 Type II compliance require from a remote work SaaS platform's engineering team?

Build your Remote Work app with Scrums.com