Telemetry Cloud Integration Plan

Last updated: 2026-03-27 13:25 ET

Purpose

Define the implementation plan for sending BionicLoop runtime/device/alert/UI telemetry to the cloud endpoint in /Users/jcostik/BionicScout with reliable delivery and timeline-quality reconstruction when a subject has a problem or gets stuck.

Primary backend handoff contract is maintained in: - Docs/Planning/TelemetryCloudContractForBionicScout.md

Related development-support planning for CloudWatch-friendly integration log review: - Docs/Planning/IntegrationLogReviewPlan.md

Phase 1 Lock Status

Phase 1 (contract lock and schema freeze) is complete for telemetry schema_version = 1.0.0. Locked baseline includes: - required envelope + correlation fields - event-family payload minimums - canonical critical UI IDs and reason-code vocabulary - execution-time semantics for loop-step events (step_executed_at over envelope created_at for step timing)

Current Baseline (Confirmed)

App (`BionicLoop`)

Basic authentication/session flow is in place.
Local telemetry exists today:
loop step records (LoopStepRecord) including algorithm input/output snapshots
CGM chart history
normalized alert lifecycle state (AppAlertCenter)
cloud-log upload path (app.log.batch) with UTC entry timestamps

Cloud (`BionicScout`)

Endpoint: POST /v1/telemetry
Scope gate: bionicscout.dev.api/telemetry.ingest (JWT scope)
Current required envelope fields:
event_type
schema_version
subject_id
created_at
payload (object)
Current behavior:
validates envelope shape
returns 202 accepted with ingest_id
does not yet enforce per-event schema validation/persistence

Implemented Baseline (Current App)

Implemented in BionicLoop (Workstream J current baseline): - Shared telemetry envelope reporter with required ingest fields and correlation metadata: - event_type, schema_version, subject_id, created_at, payload - event_id, session_id, app_version, build_number, app_env - Authenticated ingest transport reused from AuthenticatedAPIClient (POST /v1/telemetry). - Non-blocking emit behavior (telemetry failures never block loop safety execution). - Persistent telemetry outbox with: - pending/inflight/failedPermanent states - sequence ordering - retry/backoff on transient failures - permanent-failure classification for non-429 4xx - queue-cap drop policy and dropped-event telemetry - Runtime source wiring: - loop.session.armed - loop.session.reset - loop.step.executed - loop.step.skipped - loop.command.requested - loop.command.applied - loop.command.blocked - command telemetry guardrails: - loop.command.blocked only when a recommendation existed but could not be applied - cadence-only skips without recommendation do not emit command-block events - loop-command payloads preserve command_outcome (applied, blocked, uncertain) so ambiguous pump command outcomes are not flattened in cloud processing - App/auth source wiring: - app.lifecycle.launched - app.lifecycle.foregrounded - app.lifecycle.backgrounded - auth.session.authenticated - auth.session.signed_out - auth.session.restore_failed - lifecycle payload now includes timezone + clock-check fields: - device_timezone_id, device_utc_offset_seconds - clock_check_result, clock_check_skew_seconds, clock_check_rtt_ms, clock_check_at_utc - UTC check trigger policy: - launch (app.lifecycle.launched) - foreground only when last successful check is older than 24 hours - timezone/significant-time-change notifications emit app.lifecycle.foregrounded with reason = timezone_or_time_changed - drift warning policy: - app emits actionable alert ALERT-APP-CLOCK-SKEW when abs(skew_seconds) > 600 - warning is rate-limited to once per 24 hours - network/unavailable UTC checks do not raise user-facing warnings - CGM source wiring: - cgm.reading.processed - cgm.reading.masked - cgm.connection.changed - cgm.state.changed - payload guardrail: - processed/masked payloads now emit typed reliability/value/timestamp fields plus source_state - processed payload includes trend fields (trend, trend_type) for Home parity - Pump source wiring: - pump.connection.changed - pump.status.refreshed - pump.command.result - pump.pod.lifecycle - event schema guardrail: - pump.command.result is reserved for pump delivery/result payload shape and is not reused for loop-command-block payloads - pump.command.result emission is change-driven (step/requested/delivered/state delta) to suppress unchanged refresh duplicates - pump.pod.lifecycle now includes status snapshot fields (has_active_pod, has_established_session, delivery_state, reservoir_level_u) - Alert lifecycle source wiring: - alert.issued - alert.retracted - alert.acknowledged - alert.notification.scheduled - alert.notification.cleared - alert.notification.tapped - payload parity: - alert lifecycle events include title and recommended_action in addition to message/severity/dedupe fields - alert.notification.cleared emits only on an actual active-alert removal transition (idempotent retract calls on already-cleared keys do not emit) - OS-notification clear requests run on every retract attempt (idempotent clear path) to cover async schedule/retract races even when no active alert remains - source emitters suppress no-op retract fanout where possible (pump-expiration planner and CGM sync now issue retract actions only for currently active dedupe keys) - Home condition sync suppresses no-op clear calls when condition is already clear and no active alert/debounce task exists - Critical UI source wiring (partial baseline): - ui.critical.tap - ui.critical.submit - ui.critical.cancel - ui.critical.blocked - ui.critical.state_viewed - currently instrumented flows: - Home Let's Eat tap and blocked reasons - meal modal viewed/submit/cancel - Home manual BG tap (element_id: home.bg_button) - BG modal viewed/cancel and runtime submit-block reasons - submit outcome semantics: - manual BG submit is runtime-authoritative only - meal submit uses a correlated lifecycle: submitted on user confirm, then runtime-emitted accepted / success / blocked / uncertain / resolved outcomes keyed by the meal flow_id - Cloud log upload baseline wiring: - app.log.batch - local threshold + remote override policy model - integration-test session logging baseline: - persisted test_run_id session state - persisted last-session summary so the prior run ID remains copyable after stop/expiry - session-specific upload threshold override - per-entry session metadata on app.log.batch - explicit integration_test_session_started|stopped|expired marker batches - DEBUG Home settings controls to start/stop sessions and copy either the active or most recent run ID - reviewer workflow currently depends on the tester-provided run ID and UTC window because CloudWatch fanout still needs end-to-end metadata preservation verification - Algo2015 native diagnostics (BP_LOG_<timestamp>.txt) are now tailed and emitted through app.log.batch with: - persistent cursor checkpoint (fileName + offset) for relaunch resume - prefix/summary parsing (A/B/C/D/I/G/~I/~G/AD/P/PA/MB/S, STEP=, *OUT:) - severity mapping and step_hint extraction into metadata - upload-cap enforcement with explicit drop-summary line to bound volume - next development-support slice: - add operator affordances such as scenario presets/share flow and canned CloudWatch retrieval guidance - Auth/session guardrail: - recovered/refreshed tokens are validated for telemetry ingest scope before persistence, avoiding re-persisting unusable scoped-out token sets - CGM state guardrail: - cgm.state.changed emission derives has_sensor from the latest callback/refresh state to avoid stale transition telemetry - cgm.connection.changed is emitted after refresh so status_text aligns with current lifecycle state - Structured algorithm inspection migration: - algorithm.session.snapshot and algorithm.step.snapshot are now the app-side structured inspection source of truth during migration. - Local BP_*.txt / BP_LOG_*.txt remain debug/equivalence artifacts only. - Current structured BP coverage: - full BP matrix row (bp) - BP_LOG families A, I, G, ~I, ~G, AD/G24h, PA, P, B, C, D, S - Remaining app-side open fields stay explicitly open: - algorithm_build_id - pump_id - Implementation mirror: Docs/Planning/AlgorithmInspectionStructuredTelemetryPlan.md

Implementation references: - /Users/jcostik/BionicLoop/BionicLoop/App/AuthSessionNetworking.swift (CloudTelemetryReporter) - /Users/jcostik/BionicLoop/BionicLoop/Runtime/LoopRuntimeEngine.swift (loop/app/alert/pump emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Features/CGM/G7ViewModel.swift (CGM telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/PumpStatusObserver.swift (pump telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/AppPumpManagerDelegate.swift (pod lifecycle telemetry emit calls) - /Users/jcostik/BionicLoop/BionicLoopTests/BionicLoopInfrastructureTests.swift (envelope/outbox/retry/policy tests)

Known limitations (next slices): - ui.critical.* canonical control instrumentation (J3) is complete app-side; remaining work is backend validation/query shaping. - Backend schema enforcement/idempotent persistence/DLQ/replay remain open (J5). - Algo2015 native diagnostics are app-side complete (J9); remaining work is backend-side validation/query shaping under J5.

Telemetry Objectives

Capture all safety-relevant app, runtime, CGM, pump, and alert lifecycle events.
Capture critical user interactions (and blocked actions) to reconstruct incident stories.
Deliver events reliably with offline tolerance and deterministic retry behavior.
Preserve privacy while keeping enough context for clinical/engineering support.
Keep app telemetry transport independent from safe local loop execution.

Canonical Envelope (App -> Cloud)

Required now (cloud-enforced): - event_type: String - schema_version: String - subject_id: String - created_at: ISO-8601 UTC - payload: Object

Required app-side contract (implemented): - event_id: UUID - session_id: UUID (app session correlation) - app_version: String - build_number: String - app_env: String (dev|staging|prod) - auth_user_sub: String (Cognito subject; never email, UNSET only in unauthenticated continuity mode)

Optional but strongly recommended: - flow_id: UUID? (multi-step UX flows like meal announce/BG entry/setup) - device_time_offset_sec: Int? - ingest_priority: String (normal|high)

Event Taxonomy (Phase 1 Locked Set)

schema_version starts at 1.0.0 for all events in this phase.

1) App/Auth Lifecycle

app.lifecycle.launched
app.lifecycle.foregrounded
app.lifecycle.backgrounded
auth.session.authenticated
auth.session.signed_out
auth.session.restore_failed

2) Loop Runtime and Algorithm

loop.session.armed
loop.session.reset
loop.step.executed
loop.step.skipped
loop.command.requested
loop.command.applied
loop.command.blocked

Required payload fields for loop.step.*: - expected_step, executed_step, wake_cause, skip_reason, recommendation_applied - implemented skip reasons now include mealSlotConflict for competing-trigger meal submits that lose the observed slot before coordinator acceptance; cloud handling should preserve it as an explicit blocked meal outcome rather than collapsing it into a cadence-only skip. - algorithm_input_snapshot (sanitized) - algorithm_output_snapshot

3) CGM

cgm.reading.processed
cgm.reading.masked
cgm.connection.changed
cgm.state.changed

Required payload fields: - reading_timestamp, reliable, value_mgdl?, trend?, mask_reason?, source_state

4) Pump

pump.connection.changed
pump.status.refreshed
pump.command.result
pump.pod.lifecycle

Required payload fields: - delivery_state, reservoir_units, insulin_delivered_total, bolus_not_delivered, pod_active, error_code?

5) Alerts

alert.issued
alert.retracted
alert.acknowledged
alert.notification.scheduled
alert.notification.cleared
alert.notification.tapped

Required payload fields: - alert_code, source, severity, dedupe_key, requires_acknowledge, ack_state, message

6) Critical UI Interaction Telemetry (Story Reconstruction)

Primary events: - ui.critical.tap - ui.critical.submit - ui.critical.cancel - ui.critical.blocked - ui.critical.state_viewed

Required payload fields: - screen_id - element_id - action - result (success|blocked|failed|cancelled) - reason (required for blocked/failed) - flow_id (required for multi-step flows) - linked_step? - linked_alert_code?

Critical elements (minimum): - home.start_algo_button - home.reset_algo_button - home.lets_eat_button - meal.modal.meal_type_selector - meal.modal.carb_relative_selector - meal.modal.deliver_slider - meal.modal.cancel_button - home.bg_button - bg.modal.value_picker - bg.modal.submit_button - bg.modal.cancel_button - alerts.banner.acknowledge_button - settings.open_cgm_settings - settings.open_pump_settings - cgm.setup.continue_button - cgm.setup.cancel_button - pump.setup.continue_button - pump.setup.cancel_button

7) Telemetry System Health (self-observability)

telemetry.outbox.enqueued
telemetry.flush.started
telemetry.flush.succeeded
telemetry.flush.failed
telemetry.event.dropped

Required payload fields: - queue depth, batch size, retry count, error class/code, dropped_count, oldest_pending_age_sec

CloudWatch App Logging Plan (Severity-Filtered Upload)

Goals

Default cloud log upload to high-signal errors only.
Allow local settings control to increase verbosity (debug, info, warning, error).
Ensure selected threshold behavior is inclusive: selected level + higher severity are uploaded.
Add long-term remote override so support teams can increase subject logging without asking the subject to change settings.

Level model

debug (lowest)
info
warning
error (highest)

Default upload threshold: error.

Effective threshold resolution (current precedence)

Active remote override (if present and not expired)
Local user-selected threshold in settings
App default (error)

App-side design

Add a structured app logger sink that emits canonical log DTOs:
timestamp
level
subsystem
category
message_template
metadata (allowlisted)
session_id
flow_id?
subject_id
Keep full local logging behavior unchanged; apply filtering only for cloud-upload stream.
Batch logs for upload as telemetry events (event_type: app.log.batch).
Enforce client-side guardrails:
per-batch entry cap
max payload size
redaction/allowlist before enqueue
rate limits for low-level logs (debug/info) to control cost/noise

Settings UX status

Implemented (debug builds): settings control Min Upload Level in Home settings.
Options: Error (default), Warning, Info, Debug.
Persisted via CloudLogUploadPolicy.localThresholdKey.
Filtering behavior is inclusive (selected level and higher severities upload).
Unit coverage currently includes default/invalid fallback, persisted threshold behavior, and remote-override precedence/expiry.

Remote override plan (long-term)

Add backend config endpoint delivering subject-scoped logging policy:
upload_level
expires_at
reason_code
issued_by
App refreshes policy:
at launch
periodic foreground interval
on silent push-triggered refresh (future)
Override is temporary by design (TTL required) and auto-reverts to local/default when expired.
All override applications/reversions emit telemetry audit events.

CloudWatch ingestion path (planned)

BionicScout receives app.log.batch via /v1/telemetry.
Ingest validates log payload schema and writes structured entries to dedicated CloudWatch Log Group(s), partitioned by environment.
Add subscription/retention policy:
short retention for debug/info
longer retention for warning/error
Keep idempotency keying to prevent duplicate ingestion during retry.

Safety/privacy constraints

No credentials/tokens/raw PHI in log payloads.
Strict metadata allowlist for cloud-uploaded logs.
Redaction tests required for sensitive fields.
Logging transport failures must not affect loop safety behavior.

Algo2015 Native C++ Log Stream -> Cloud Plan

Current state (observed in source)

Algo2015 emits verbose textual logs in two channels on Apple builds:
cout lines (USE_COUT enabled in Algorithm_2015_10_13.cpp)
session file BP_LOG_<timestamp>.txt in app Documents directory via RecordConsole()
The native stream includes compact record prefixes and diagnostics, for example:
A/B/C/D/I/G/~I/~G/AD/P/PA/S records in LogFile
console summary lines like STEP=... and *OUT: ...
File writes are flushed during step execution; stream is append-only by session.
Separate algorithm data dumps (BP_<timestamp>.txt, matrix helpers) are not appropriate for routine cloud upload.

Design goals

Route Algo2015 native diagnostics to cloud without changing algorithm dosing behavior.
Avoid scraping Xcode/device console; use deterministic in-app capture.
Keep default upload volume low (error threshold), with optional controlled escalation.
Preserve enough trace detail to correlate with step telemetry for incident reconstruction.

Recommended capture strategy (phased)

AL1. Near-term (low-risk): file-tail capture from `BP_LOG_*.txt`

Implement app-side tailer that:
discovers current BP_LOG_*.txt for active algorithm session
persists byte-offset checkpoint per log file
reads only appended lines after each step / periodic flush
Convert appended lines into structured log DTO entries with:
subsystem = algo2015.native
category = bp_log
raw_line
record_prefix (if present)
step_hint (parsed from line where available)
Enqueue through shared cloud log pipeline (app.log.batch).

AL2. Parsing and normalization

Add lightweight parser for known prefix records (A/B/C/D/...) and key-value summaries (STEP=, *OUT:).
Populate parsed metadata fields for queryability:
algo_step, cgm_value, bg_value, ireq, idel, etc. when parseable
Preserve raw line alongside parsed map for forensic continuity.

AL3. Severity mapping

Map native lines to cloud log levels:
error: lines containing hard failures (e.g., FAILED WRITE, fatal execution exits)
warning: warning/degraded-path markers (WARNING, NO-GO, algorithm non-execution notices)
info/debug: routine per-step trace lines (A/B/C/D, STEP=, *OUT)
Apply existing threshold model:
default upload: error
include lower levels only when local/remote threshold allows.

AL4. Correlation and volume controls

Attach correlation IDs on each uploaded native line:
session_id, flow_id?, subject_id, event_id
algo_step where parsed
Add guardrails:
per-step max native lines uploaded
batching and rate limits for info/debug
drop-summary telemetry when limits are exceeded

AL5. Long-term optimization (optional)

Replace/augment file-tail capture with bridge callback sink for direct line streaming from C++.
Keep file-tail path as fallback for compatibility.
Use callback path to reduce file I/O and simplify session/file-discovery logic.

Reliability Model (App)

Delivery guarantees

Target: at-least-once delivery with idempotent replay.
Event identity: event_id generated once at enqueue time.
Ordering: preserve local event order using monotonic sequence (sequence_no).

Outbox design

Persistent outbox states:
pending
inflight
acked
failed_permanent
Outbox survives app relaunch.
Safety-critical events (alert.*, loop.command.blocked, severe device faults) use priority flush.

Flush triggers

app foreground
enqueue when network reachable
periodic timer while active
background task windows

Retry policy

Exponential backoff + jitter for transient failures.
4xx schema/auth failures -> mark permanent, emit telemetry.event.dropped summary event.
No retry path may block local loop operation.

Backpressure and retention

Queue cap with explicit drop policy:
never drop active safety-critical alerts before lower-priority UX events.
emit deterministic drop summaries.

Cloud Evolution Plan (`BionicScout`)

C1. Contract hardening

Add per-event schema registry keyed by event_type + schema_version.
Reject unknown versions with explicit error details.

C2. Durable ingest

Persist accepted events with idempotency key (subject_id, event_id).
Add replay-safe writes and DLQ workflows.

C3. Routing and analytics

Route to durable store and query path for timeline reconstruction.
Keep retention/lifecycle classed by data criticality.

C4. Incident timeline support

Query by subject_id, session_id, flow_id, time window.
Build “what happened” timeline from device state + loop action + user action + alert lifecycle.

Phased Implementation Checklist

T1. Contract and schema freeze

[x] Freeze event-type catalog and payload contracts.
[x] Publish JSON examples for each event family.
[x] Align app/cloud enum constants and naming rules.

T2. App telemetry SDK boundary

[x] Define a single app telemetry emitter interface and DTO model.
[x] Define source adapters for runtime, CGM, pump, alerts, and UI features.
[x] Add field-level redaction/allowlist rules.

T3. Outbox and transport reliability

[x] Implement persistent outbox with sequence/idempotency metadata.
[x] Implement flush worker (batch, retry, backoff, priority path).
[x] Add transport health metrics and drop accounting events.

T4. Source instrumentation

[x] Wire loop/runtime/algorithm events.
[x] Wire CGM and pump state/command events.
[x] Wire full alert lifecycle events.
[x] Wire critical UI interactions for flow storytelling.

T5. Cloud-side contract enforcement

[x] Publish strict J5 backend contract packet (validation order, canonical error codes, idempotency conflict semantics, contract-test matrix) in TelemetryCloudContractForBionicScout.md.
[ ] Implement per-event schema validation in ingest service.
[ ] Implement idempotent durable persistence.
[ ] Add DLQ triage/replay tooling.

T6. Story reconstruction workflows

[ ] Define incident timeline query API/report format.
[x] Add support playbook examples for “subject stuck” scenarios.
[ ] Validate timeline completeness against known failure drills.

Minimum “subject stuck” timeline completeness criteria (app + backend contract): - At least one app.lifecycle.* event in the episode window with session_id and app_env. - Full ui.critical.* chain for the blocked workflow (tap -> optional state_viewed -> submit/cancel/blocked) with stable flow_id. - Corresponding loop/pump/cgm state transitions (loop.step.*, pump.status.refreshed, cgm.connection.changed, cgm.state.changed) within the same time window. - Alert lifecycle continuity for involved alerts (alert.issued and alert.retracted or alert.acknowledged) with stable dedupe_key. - If a command was blocked, include a machine reason (reason) and related context (linked_alert_code when applicable). - If a step executed, timeline consumers must prioritize payload.step_executed_at over envelope created_at for algorithm-time reconstruction.

T7. Verification and evidence

[ ] Add unit tests for event encoding/redaction/outbox state machine.
[ ] Add integration tests for auth scope and ingest responses.
[ ] Add end-to-end replay tests for incident storytelling completeness.
[ ] Publish STR-CLOUD-* evidence artifacts for traceability.

T8. CloudWatch severity-filtered logging rollout

[x] Implement app log DTO schema and cloud-upload logger sink.
[x] Implement threshold filter logic (selected + above) with default error.
[x] Add settings UI for local threshold selection and persistence.
[ ] Add remote override policy fetch/apply/expiry behavior with audit telemetry.
[ ] Add cloud validation and CloudWatch write path for app.log.batch.
[ ] Add tests for threshold logic, remote override precedence, TTL expiry, and redaction.

T9. Algo2015 native log routing

[x] Implement BP_LOG tailer with persistent offsets and session-file discovery.
[x] Parse and normalize native record prefixes/key-value summaries into structured metadata.
[x] Map native lines to severity levels and apply threshold filtering defaults.
[x] Correlate native log uploads with step/session IDs and existing loop telemetry.
[x] Add tests for parser correctness, offset resume behavior, and duplicate suppression on relaunch.
[ ] Add load/volume tests to ensure native trace uploads do not starve higher-priority telemetry.

Verification Strategy (Planned)

Deterministic fixture-driven tests for event payload generation.
Fault-injection tests for network loss, auth expiry, and 4xx/5xx ingest responses.
Replay test that proves a single timeline can answer:
what devices reported
what loop executed/skipped/applied/blocked
what alerts were shown/acknowledged/cleared
what user actions occurred on critical controls

Open Decisions Requiring Team Input

Final retention policy for event classes and timeline history depth.
Priority/severity matrix for notification fan-out.
Dashboard audiences and role-scoped query views for first release.
Exact cloud persistence/query stack sequence in BionicScout (minimal now vs full architecture rollout).
Remote override policy governance:
who can issue/approve override
max allowed override duration
whether debug requires additional approval in production
Whether to keep long-term native capture on file-tail path or invest in bridge callback streaming in phase 2.

/Users/jcostik/BionicLoop/Docs/Planning/ExecutionPlan.md (Workstream J)
/Users/jcostik/BionicLoop/Docs/Planning/DevChangePlan.md (implementation touchpoints)
/Users/jcostik/BionicScout/Docs/Planning/CloudBackendRequirements.md
/Users/jcostik/BionicScout/Docs/Planning/CloudBackendExecutionStatus-2026-02-19-TelemetryIngest.md