Skip to content

Telemetry Cloud Integration Plan

Last updated: 2026-03-27 13:25 ET

Purpose

Define the implementation plan for sending BionicLoop runtime/device/alert/UI telemetry to the cloud endpoint in /Users/jcostik/BionicScout with reliable delivery and timeline-quality reconstruction when a subject has a problem or gets stuck.

Primary backend handoff contract is maintained in: - Docs/Planning/TelemetryCloudContractForBionicScout.md

Related development-support planning for CloudWatch-friendly integration log review: - Docs/Planning/IntegrationLogReviewPlan.md

Phase 1 Lock Status

Phase 1 (contract lock and schema freeze) is complete for telemetry schema_version = 1.0.0. Locked baseline includes: - required envelope + correlation fields - event-family payload minimums - canonical critical UI IDs and reason-code vocabulary - execution-time semantics for loop-step events (step_executed_at over envelope created_at for step timing)

Current Baseline (Confirmed)

App (BionicLoop)

  • Basic authentication/session flow is in place.
  • Local telemetry exists today:
  • loop step records (LoopStepRecord) including algorithm input/output snapshots
  • CGM chart history
  • normalized alert lifecycle state (AppAlertCenter)
  • cloud-log upload path (app.log.batch) with UTC entry timestamps

Cloud (BionicScout)

  • Endpoint: POST /v1/telemetry
  • Scope gate: bionicscout.dev.api/telemetry.ingest (JWT scope)
  • Current required envelope fields:
  • event_type
  • schema_version
  • subject_id
  • created_at
  • payload (object)
  • Current behavior:
  • validates envelope shape
  • returns 202 accepted with ingest_id
  • does not yet enforce per-event schema validation/persistence

Implemented Baseline (Current App)

Implemented in BionicLoop (Workstream J current baseline): - Shared telemetry envelope reporter with required ingest fields and correlation metadata: - event_type, schema_version, subject_id, created_at, payload - event_id, session_id, app_version, build_number, app_env - Authenticated ingest transport reused from AuthenticatedAPIClient (POST /v1/telemetry). - Non-blocking emit behavior (telemetry failures never block loop safety execution). - Persistent telemetry outbox with: - pending/inflight/failedPermanent states - sequence ordering - retry/backoff on transient failures - permanent-failure classification for non-429 4xx - queue-cap drop policy and dropped-event telemetry - Runtime source wiring: - loop.session.armed - loop.session.reset - loop.step.executed - loop.step.skipped - loop.command.requested - loop.command.applied - loop.command.blocked - command telemetry guardrails: - loop.command.blocked only when a recommendation existed but could not be applied - cadence-only skips without recommendation do not emit command-block events - loop-command payloads preserve command_outcome (applied, blocked, uncertain) so ambiguous pump command outcomes are not flattened in cloud processing - App/auth source wiring: - app.lifecycle.launched - app.lifecycle.foregrounded - app.lifecycle.backgrounded - auth.session.authenticated - auth.session.signed_out - auth.session.restore_failed - lifecycle payload now includes timezone + clock-check fields: - device_timezone_id, device_utc_offset_seconds - clock_check_result, clock_check_skew_seconds, clock_check_rtt_ms, clock_check_at_utc - UTC check trigger policy: - launch (app.lifecycle.launched) - foreground only when last successful check is older than 24 hours - timezone/significant-time-change notifications emit app.lifecycle.foregrounded with reason = timezone_or_time_changed - drift warning policy: - app emits actionable alert ALERT-APP-CLOCK-SKEW when abs(skew_seconds) > 600 - warning is rate-limited to once per 24 hours - network/unavailable UTC checks do not raise user-facing warnings - CGM source wiring: - cgm.reading.processed - cgm.reading.masked - cgm.connection.changed - cgm.state.changed - payload guardrail: - processed/masked payloads now emit typed reliability/value/timestamp fields plus source_state - processed payload includes trend fields (trend, trend_type) for Home parity - Pump source wiring: - pump.connection.changed - pump.status.refreshed - pump.command.result - pump.pod.lifecycle - event schema guardrail: - pump.command.result is reserved for pump delivery/result payload shape and is not reused for loop-command-block payloads - pump.command.result emission is change-driven (step/requested/delivered/state delta) to suppress unchanged refresh duplicates - pump.pod.lifecycle now includes status snapshot fields (has_active_pod, has_established_session, delivery_state, reservoir_level_u) - Alert lifecycle source wiring: - alert.issued - alert.retracted - alert.acknowledged - alert.notification.scheduled - alert.notification.cleared - alert.notification.tapped - payload parity: - alert lifecycle events include title and recommended_action in addition to message/severity/dedupe fields - alert.notification.cleared emits only on an actual active-alert removal transition (idempotent retract calls on already-cleared keys do not emit) - OS-notification clear requests run on every retract attempt (idempotent clear path) to cover async schedule/retract races even when no active alert remains - source emitters suppress no-op retract fanout where possible (pump-expiration planner and CGM sync now issue retract actions only for currently active dedupe keys) - Home condition sync suppresses no-op clear calls when condition is already clear and no active alert/debounce task exists - Critical UI source wiring (partial baseline): - ui.critical.tap - ui.critical.submit - ui.critical.cancel - ui.critical.blocked - ui.critical.state_viewed - currently instrumented flows: - Home Let's Eat tap and blocked reasons - meal modal viewed/submit/cancel - Home manual BG tap (element_id: home.bg_button) - BG modal viewed/cancel and runtime submit-block reasons - submit outcome semantics: - manual BG submit is runtime-authoritative only - meal submit uses a correlated lifecycle: submitted on user confirm, then runtime-emitted accepted / success / blocked / uncertain / resolved outcomes keyed by the meal flow_id - Cloud log upload baseline wiring: - app.log.batch - local threshold + remote override policy model - integration-test session logging baseline: - persisted test_run_id session state - persisted last-session summary so the prior run ID remains copyable after stop/expiry - session-specific upload threshold override - per-entry session metadata on app.log.batch - explicit integration_test_session_started|stopped|expired marker batches - DEBUG Home settings controls to start/stop sessions and copy either the active or most recent run ID - reviewer workflow currently depends on the tester-provided run ID and UTC window because CloudWatch fanout still needs end-to-end metadata preservation verification - Algo2015 native diagnostics (BP_LOG_<timestamp>.txt) are now tailed and emitted through app.log.batch with: - persistent cursor checkpoint (fileName + offset) for relaunch resume - prefix/summary parsing (A/B/C/D/I/G/~I/~G/AD/P/PA/MB/S, STEP=, *OUT:) - severity mapping and step_hint extraction into metadata - upload-cap enforcement with explicit drop-summary line to bound volume - next development-support slice: - add operator affordances such as scenario presets/share flow and canned CloudWatch retrieval guidance - Auth/session guardrail: - recovered/refreshed tokens are validated for telemetry ingest scope before persistence, avoiding re-persisting unusable scoped-out token sets - CGM state guardrail: - cgm.state.changed emission derives has_sensor from the latest callback/refresh state to avoid stale transition telemetry - cgm.connection.changed is emitted after refresh so status_text aligns with current lifecycle state - Structured algorithm inspection migration: - algorithm.session.snapshot and algorithm.step.snapshot are now the app-side structured inspection source of truth during migration. - Local BP_*.txt / BP_LOG_*.txt remain debug/equivalence artifacts only. - Current structured BP coverage: - full BP matrix row (bp) - BP_LOG families A, I, G, ~I, ~G, AD/G24h, PA, P, B, C, D, S - Remaining app-side open fields stay explicitly open: - algorithm_build_id - pump_id - Implementation mirror: Docs/Planning/AlgorithmInspectionStructuredTelemetryPlan.md

Implementation references: - /Users/jcostik/BionicLoop/BionicLoop/App/AuthSessionNetworking.swift (CloudTelemetryReporter) - /Users/jcostik/BionicLoop/BionicLoop/Runtime/LoopRuntimeEngine.swift (loop/app/alert/pump emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Features/CGM/G7ViewModel.swift (CGM telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/PumpStatusObserver.swift (pump telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/AppPumpManagerDelegate.swift (pod lifecycle telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoopTests/BionicLoopInfrastructureTests.swift (envelope/outbox/retry/policy tests)

Known limitations (next slices): - ui.critical.* canonical control instrumentation (J3) is complete app-side; remaining work is backend validation/query shaping. - Backend schema enforcement/idempotent persistence/DLQ/replay remain open (J5). - Algo2015 native diagnostics are app-side complete (J9); remaining work is backend-side validation/query shaping under J5.

Telemetry Objectives

  • Capture all safety-relevant app, runtime, CGM, pump, and alert lifecycle events.
  • Capture critical user interactions (and blocked actions) to reconstruct incident stories.
  • Deliver events reliably with offline tolerance and deterministic retry behavior.
  • Preserve privacy while keeping enough context for clinical/engineering support.
  • Keep app telemetry transport independent from safe local loop execution.

Canonical Envelope (App -> Cloud)

Required now (cloud-enforced): - event_type: String - schema_version: String - subject_id: String - created_at: ISO-8601 UTC - payload: Object

Required app-side contract (implemented): - event_id: UUID - session_id: UUID (app session correlation) - app_version: String - build_number: String - app_env: String (dev|staging|prod) - auth_user_sub: String (Cognito subject; never email, UNSET only in unauthenticated continuity mode)

Optional but strongly recommended: - flow_id: UUID? (multi-step UX flows like meal announce/BG entry/setup) - device_time_offset_sec: Int? - ingest_priority: String (normal|high)

Event Taxonomy (Phase 1 Locked Set)

schema_version starts at 1.0.0 for all events in this phase.

1) App/Auth Lifecycle

  • app.lifecycle.launched
  • app.lifecycle.foregrounded
  • app.lifecycle.backgrounded
  • auth.session.authenticated
  • auth.session.signed_out
  • auth.session.restore_failed

2) Loop Runtime and Algorithm

  • loop.session.armed
  • loop.session.reset
  • loop.step.executed
  • loop.step.skipped
  • loop.command.requested
  • loop.command.applied
  • loop.command.blocked

Required payload fields for loop.step.*: - expected_step, executed_step, wake_cause, skip_reason, recommendation_applied - implemented skip reasons now include mealSlotConflict for competing-trigger meal submits that lose the observed slot before coordinator acceptance; cloud handling should preserve it as an explicit blocked meal outcome rather than collapsing it into a cadence-only skip. - algorithm_input_snapshot (sanitized) - algorithm_output_snapshot

3) CGM

  • cgm.reading.processed
  • cgm.reading.masked
  • cgm.connection.changed
  • cgm.state.changed

Required payload fields: - reading_timestamp, reliable, value_mgdl?, trend?, mask_reason?, source_state

4) Pump

  • pump.connection.changed
  • pump.status.refreshed
  • pump.command.result
  • pump.pod.lifecycle

Required payload fields: - delivery_state, reservoir_units, insulin_delivered_total, bolus_not_delivered, pod_active, error_code?

5) Alerts

  • alert.issued
  • alert.retracted
  • alert.acknowledged
  • alert.notification.scheduled
  • alert.notification.cleared
  • alert.notification.tapped

Required payload fields: - alert_code, source, severity, dedupe_key, requires_acknowledge, ack_state, message

6) Critical UI Interaction Telemetry (Story Reconstruction)

Primary events: - ui.critical.tap - ui.critical.submit - ui.critical.cancel - ui.critical.blocked - ui.critical.state_viewed

Required payload fields: - screen_id - element_id - action - result (success|blocked|failed|cancelled) - reason (required for blocked/failed) - flow_id (required for multi-step flows) - linked_step? - linked_alert_code?

Critical elements (minimum): - home.start_algo_button - home.reset_algo_button - home.lets_eat_button - meal.modal.meal_type_selector - meal.modal.carb_relative_selector - meal.modal.deliver_slider - meal.modal.cancel_button - home.bg_button - bg.modal.value_picker - bg.modal.submit_button - bg.modal.cancel_button - alerts.banner.acknowledge_button - settings.open_cgm_settings - settings.open_pump_settings - cgm.setup.continue_button - cgm.setup.cancel_button - pump.setup.continue_button - pump.setup.cancel_button

7) Telemetry System Health (self-observability)

  • telemetry.outbox.enqueued
  • telemetry.flush.started
  • telemetry.flush.succeeded
  • telemetry.flush.failed
  • telemetry.event.dropped

Required payload fields: - queue depth, batch size, retry count, error class/code, dropped_count, oldest_pending_age_sec

CloudWatch App Logging Plan (Severity-Filtered Upload)

Goals

  • Default cloud log upload to high-signal errors only.
  • Allow local settings control to increase verbosity (debug, info, warning, error).
  • Ensure selected threshold behavior is inclusive: selected level + higher severity are uploaded.
  • Add long-term remote override so support teams can increase subject logging without asking the subject to change settings.

Level model

  • debug (lowest)
  • info
  • warning
  • error (highest)

Default upload threshold: error.

Effective threshold resolution (current precedence)

  1. Active remote override (if present and not expired)
  2. Local user-selected threshold in settings
  3. App default (error)

App-side design

  • Add a structured app logger sink that emits canonical log DTOs:
  • timestamp
  • level
  • subsystem
  • category
  • message_template
  • metadata (allowlisted)
  • session_id
  • flow_id?
  • subject_id
  • Keep full local logging behavior unchanged; apply filtering only for cloud-upload stream.
  • Batch logs for upload as telemetry events (event_type: app.log.batch).
  • Enforce client-side guardrails:
  • per-batch entry cap
  • max payload size
  • redaction/allowlist before enqueue
  • rate limits for low-level logs (debug/info) to control cost/noise

Settings UX status

  • Implemented (debug builds): settings control Min Upload Level in Home settings.
  • Options: Error (default), Warning, Info, Debug.
  • Persisted via CloudLogUploadPolicy.localThresholdKey.
  • Filtering behavior is inclusive (selected level and higher severities upload).
  • Unit coverage currently includes default/invalid fallback, persisted threshold behavior, and remote-override precedence/expiry.

Remote override plan (long-term)

  • Add backend config endpoint delivering subject-scoped logging policy:
  • upload_level
  • expires_at
  • reason_code
  • issued_by
  • App refreshes policy:
  • at launch
  • periodic foreground interval
  • on silent push-triggered refresh (future)
  • Override is temporary by design (TTL required) and auto-reverts to local/default when expired.
  • All override applications/reversions emit telemetry audit events.

CloudWatch ingestion path (planned)

  • BionicScout receives app.log.batch via /v1/telemetry.
  • Ingest validates log payload schema and writes structured entries to dedicated CloudWatch Log Group(s), partitioned by environment.
  • Add subscription/retention policy:
  • short retention for debug/info
  • longer retention for warning/error
  • Keep idempotency keying to prevent duplicate ingestion during retry.

Safety/privacy constraints

  • No credentials/tokens/raw PHI in log payloads.
  • Strict metadata allowlist for cloud-uploaded logs.
  • Redaction tests required for sensitive fields.
  • Logging transport failures must not affect loop safety behavior.

Algo2015 Native C++ Log Stream -> Cloud Plan

Current state (observed in source)

  • Algo2015 emits verbose textual logs in two channels on Apple builds:
  • cout lines (USE_COUT enabled in Algorithm_2015_10_13.cpp)
  • session file BP_LOG_<timestamp>.txt in app Documents directory via RecordConsole()
  • The native stream includes compact record prefixes and diagnostics, for example:
  • A/B/C/D/I/G/~I/~G/AD/P/PA/S records in LogFile
  • console summary lines like STEP=... and *OUT: ...
  • File writes are flushed during step execution; stream is append-only by session.
  • Separate algorithm data dumps (BP_<timestamp>.txt, matrix helpers) are not appropriate for routine cloud upload.

Design goals

  • Route Algo2015 native diagnostics to cloud without changing algorithm dosing behavior.
  • Avoid scraping Xcode/device console; use deterministic in-app capture.
  • Keep default upload volume low (error threshold), with optional controlled escalation.
  • Preserve enough trace detail to correlate with step telemetry for incident reconstruction.

AL1. Near-term (low-risk): file-tail capture from BP_LOG_*.txt

  • Implement app-side tailer that:
  • discovers current BP_LOG_*.txt for active algorithm session
  • persists byte-offset checkpoint per log file
  • reads only appended lines after each step / periodic flush
  • Convert appended lines into structured log DTO entries with:
  • subsystem = algo2015.native
  • category = bp_log
  • raw_line
  • record_prefix (if present)
  • step_hint (parsed from line where available)
  • Enqueue through shared cloud log pipeline (app.log.batch).

AL2. Parsing and normalization

  • Add lightweight parser for known prefix records (A/B/C/D/...) and key-value summaries (STEP=, *OUT:).
  • Populate parsed metadata fields for queryability:
  • algo_step, cgm_value, bg_value, ireq, idel, etc. when parseable
  • Preserve raw line alongside parsed map for forensic continuity.

AL3. Severity mapping

  • Map native lines to cloud log levels:
  • error: lines containing hard failures (e.g., FAILED WRITE, fatal execution exits)
  • warning: warning/degraded-path markers (WARNING, NO-GO, algorithm non-execution notices)
  • info/debug: routine per-step trace lines (A/B/C/D, STEP=, *OUT)
  • Apply existing threshold model:
  • default upload: error
  • include lower levels only when local/remote threshold allows.

AL4. Correlation and volume controls

  • Attach correlation IDs on each uploaded native line:
  • session_id, flow_id?, subject_id, event_id
  • algo_step where parsed
  • Add guardrails:
  • per-step max native lines uploaded
  • batching and rate limits for info/debug
  • drop-summary telemetry when limits are exceeded

AL5. Long-term optimization (optional)

  • Replace/augment file-tail capture with bridge callback sink for direct line streaming from C++.
  • Keep file-tail path as fallback for compatibility.
  • Use callback path to reduce file I/O and simplify session/file-discovery logic.

Reliability Model (App)

Delivery guarantees

  • Target: at-least-once delivery with idempotent replay.
  • Event identity: event_id generated once at enqueue time.
  • Ordering: preserve local event order using monotonic sequence (sequence_no).

Outbox design

  • Persistent outbox states:
  • pending
  • inflight
  • acked
  • failed_permanent
  • Outbox survives app relaunch.
  • Safety-critical events (alert.*, loop.command.blocked, severe device faults) use priority flush.

Flush triggers

  • app foreground
  • enqueue when network reachable
  • periodic timer while active
  • background task windows

Retry policy

  • Exponential backoff + jitter for transient failures.
  • 4xx schema/auth failures -> mark permanent, emit telemetry.event.dropped summary event.
  • No retry path may block local loop operation.

Backpressure and retention

  • Queue cap with explicit drop policy:
  • never drop active safety-critical alerts before lower-priority UX events.
  • emit deterministic drop summaries.

Cloud Evolution Plan (BionicScout)

C1. Contract hardening

  • Add per-event schema registry keyed by event_type + schema_version.
  • Reject unknown versions with explicit error details.

C2. Durable ingest

  • Persist accepted events with idempotency key (subject_id, event_id).
  • Add replay-safe writes and DLQ workflows.

C3. Routing and analytics

  • Route to durable store and query path for timeline reconstruction.
  • Keep retention/lifecycle classed by data criticality.

C4. Incident timeline support

  • Query by subject_id, session_id, flow_id, time window.
  • Build “what happened” timeline from device state + loop action + user action + alert lifecycle.

Phased Implementation Checklist

T1. Contract and schema freeze

  • [x] Freeze event-type catalog and payload contracts.
  • [x] Publish JSON examples for each event family.
  • [x] Align app/cloud enum constants and naming rules.

T2. App telemetry SDK boundary

  • [x] Define a single app telemetry emitter interface and DTO model.
  • [x] Define source adapters for runtime, CGM, pump, alerts, and UI features.
  • [x] Add field-level redaction/allowlist rules.

T3. Outbox and transport reliability

  • [x] Implement persistent outbox with sequence/idempotency metadata.
  • [x] Implement flush worker (batch, retry, backoff, priority path).
  • [x] Add transport health metrics and drop accounting events.

T4. Source instrumentation

  • [x] Wire loop/runtime/algorithm events.
  • [x] Wire CGM and pump state/command events.
  • [x] Wire full alert lifecycle events.
  • [x] Wire critical UI interactions for flow storytelling.

T5. Cloud-side contract enforcement

  • [x] Publish strict J5 backend contract packet (validation order, canonical error codes, idempotency conflict semantics, contract-test matrix) in TelemetryCloudContractForBionicScout.md.
  • [ ] Implement per-event schema validation in ingest service.
  • [ ] Implement idempotent durable persistence.
  • [ ] Add DLQ triage/replay tooling.

T6. Story reconstruction workflows

  • [ ] Define incident timeline query API/report format.
  • [x] Add support playbook examples for “subject stuck” scenarios.
  • [ ] Validate timeline completeness against known failure drills.

Minimum “subject stuck” timeline completeness criteria (app + backend contract): - At least one app.lifecycle.* event in the episode window with session_id and app_env. - Full ui.critical.* chain for the blocked workflow (tap -> optional state_viewed -> submit/cancel/blocked) with stable flow_id. - Corresponding loop/pump/cgm state transitions (loop.step.*, pump.status.refreshed, cgm.connection.changed, cgm.state.changed) within the same time window. - Alert lifecycle continuity for involved alerts (alert.issued and alert.retracted or alert.acknowledged) with stable dedupe_key. - If a command was blocked, include a machine reason (reason) and related context (linked_alert_code when applicable). - If a step executed, timeline consumers must prioritize payload.step_executed_at over envelope created_at for algorithm-time reconstruction.

T7. Verification and evidence

  • [ ] Add unit tests for event encoding/redaction/outbox state machine.
  • [ ] Add integration tests for auth scope and ingest responses.
  • [ ] Add end-to-end replay tests for incident storytelling completeness.
  • [ ] Publish STR-CLOUD-* evidence artifacts for traceability.

T8. CloudWatch severity-filtered logging rollout

  • [x] Implement app log DTO schema and cloud-upload logger sink.
  • [x] Implement threshold filter logic (selected + above) with default error.
  • [x] Add settings UI for local threshold selection and persistence.
  • [ ] Add remote override policy fetch/apply/expiry behavior with audit telemetry.
  • [ ] Add cloud validation and CloudWatch write path for app.log.batch.
  • [ ] Add tests for threshold logic, remote override precedence, TTL expiry, and redaction.

T9. Algo2015 native log routing

  • [x] Implement BP_LOG tailer with persistent offsets and session-file discovery.
  • [x] Parse and normalize native record prefixes/key-value summaries into structured metadata.
  • [x] Map native lines to severity levels and apply threshold filtering defaults.
  • [x] Correlate native log uploads with step/session IDs and existing loop telemetry.
  • [x] Add tests for parser correctness, offset resume behavior, and duplicate suppression on relaunch.
  • [ ] Add load/volume tests to ensure native trace uploads do not starve higher-priority telemetry.

Verification Strategy (Planned)

  • Deterministic fixture-driven tests for event payload generation.
  • Fault-injection tests for network loss, auth expiry, and 4xx/5xx ingest responses.
  • Replay test that proves a single timeline can answer:
  • what devices reported
  • what loop executed/skipped/applied/blocked
  • what alerts were shown/acknowledged/cleared
  • what user actions occurred on critical controls

Open Decisions Requiring Team Input

  • Final retention policy for event classes and timeline history depth.
  • Priority/severity matrix for notification fan-out.
  • Dashboard audiences and role-scoped query views for first release.
  • Exact cloud persistence/query stack sequence in BionicScout (minimal now vs full architecture rollout).
  • Remote override policy governance:
  • who can issue/approve override
  • max allowed override duration
  • whether debug requires additional approval in production
  • Whether to keep long-term native capture on file-tail path or invest in bridge callback streaming in phase 2.
  • /Users/jcostik/BionicLoop/Docs/Planning/ExecutionPlan.md (Workstream J)
  • /Users/jcostik/BionicLoop/Docs/Planning/DevChangePlan.md (implementation touchpoints)
  • /Users/jcostik/BionicScout/Docs/Planning/CloudBackendRequirements.md
  • /Users/jcostik/BionicScout/Docs/Planning/CloudBackendExecutionStatus-2026-02-19-TelemetryIngest.md