Telemetry Cloud Integration Plan
Last updated: 2026-03-27 13:25 ET
Purpose
Define the implementation plan for sending BionicLoop runtime/device/alert/UI telemetry to the cloud endpoint in /Users/jcostik/BionicScout with reliable delivery and timeline-quality reconstruction when a subject has a problem or gets stuck.
Primary backend handoff contract is maintained in:
- Docs/Planning/TelemetryCloudContractForBionicScout.md
Related development-support planning for CloudWatch-friendly integration log review:
- Docs/Planning/IntegrationLogReviewPlan.md
Phase 1 Lock Status
Phase 1 (contract lock and schema freeze) is complete for telemetry schema_version = 1.0.0.
Locked baseline includes:
- required envelope + correlation fields
- event-family payload minimums
- canonical critical UI IDs and reason-code vocabulary
- execution-time semantics for loop-step events (step_executed_at over envelope created_at for step timing)
Current Baseline (Confirmed)
App (BionicLoop)
- Basic authentication/session flow is in place.
- Local telemetry exists today:
- loop step records (
LoopStepRecord) including algorithm input/output snapshots - CGM chart history
- normalized alert lifecycle state (
AppAlertCenter) - cloud-log upload path (
app.log.batch) with UTC entry timestamps
Cloud (BionicScout)
- Endpoint:
POST /v1/telemetry - Scope gate:
bionicscout.dev.api/telemetry.ingest(JWT scope) - Current required envelope fields:
event_typeschema_versionsubject_idcreated_atpayload(object)- Current behavior:
- validates envelope shape
- returns
202 acceptedwithingest_id - does not yet enforce per-event schema validation/persistence
Implemented Baseline (Current App)
Implemented in BionicLoop (Workstream J current baseline):
- Shared telemetry envelope reporter with required ingest fields and correlation metadata:
- event_type, schema_version, subject_id, created_at, payload
- event_id, session_id, app_version, build_number, app_env
- Authenticated ingest transport reused from AuthenticatedAPIClient (POST /v1/telemetry).
- Non-blocking emit behavior (telemetry failures never block loop safety execution).
- Persistent telemetry outbox with:
- pending/inflight/failedPermanent states
- sequence ordering
- retry/backoff on transient failures
- permanent-failure classification for non-429 4xx
- queue-cap drop policy and dropped-event telemetry
- Runtime source wiring:
- loop.session.armed
- loop.session.reset
- loop.step.executed
- loop.step.skipped
- loop.command.requested
- loop.command.applied
- loop.command.blocked
- command telemetry guardrails:
- loop.command.blocked only when a recommendation existed but could not be applied
- cadence-only skips without recommendation do not emit command-block events
- loop-command payloads preserve command_outcome (applied, blocked, uncertain) so ambiguous pump command outcomes are not flattened in cloud processing
- App/auth source wiring:
- app.lifecycle.launched
- app.lifecycle.foregrounded
- app.lifecycle.backgrounded
- auth.session.authenticated
- auth.session.signed_out
- auth.session.restore_failed
- lifecycle payload now includes timezone + clock-check fields:
- device_timezone_id, device_utc_offset_seconds
- clock_check_result, clock_check_skew_seconds, clock_check_rtt_ms, clock_check_at_utc
- UTC check trigger policy:
- launch (app.lifecycle.launched)
- foreground only when last successful check is older than 24 hours
- timezone/significant-time-change notifications emit app.lifecycle.foregrounded with reason = timezone_or_time_changed
- drift warning policy:
- app emits actionable alert ALERT-APP-CLOCK-SKEW when abs(skew_seconds) > 600
- warning is rate-limited to once per 24 hours
- network/unavailable UTC checks do not raise user-facing warnings
- CGM source wiring:
- cgm.reading.processed
- cgm.reading.masked
- cgm.connection.changed
- cgm.state.changed
- payload guardrail:
- processed/masked payloads now emit typed reliability/value/timestamp fields plus source_state
- processed payload includes trend fields (trend, trend_type) for Home parity
- Pump source wiring:
- pump.connection.changed
- pump.status.refreshed
- pump.command.result
- pump.pod.lifecycle
- event schema guardrail:
- pump.command.result is reserved for pump delivery/result payload shape and is not reused for loop-command-block payloads
- pump.command.result emission is change-driven (step/requested/delivered/state delta) to suppress unchanged refresh duplicates
- pump.pod.lifecycle now includes status snapshot fields (has_active_pod, has_established_session, delivery_state, reservoir_level_u)
- Alert lifecycle source wiring:
- alert.issued
- alert.retracted
- alert.acknowledged
- alert.notification.scheduled
- alert.notification.cleared
- alert.notification.tapped
- payload parity:
- alert lifecycle events include title and recommended_action in addition to message/severity/dedupe fields
- alert.notification.cleared emits only on an actual active-alert removal transition (idempotent retract calls on already-cleared keys do not emit)
- OS-notification clear requests run on every retract attempt (idempotent clear path) to cover async schedule/retract races even when no active alert remains
- source emitters suppress no-op retract fanout where possible (pump-expiration planner and CGM sync now issue retract actions only for currently active dedupe keys)
- Home condition sync suppresses no-op clear calls when condition is already clear and no active alert/debounce task exists
- Critical UI source wiring (partial baseline):
- ui.critical.tap
- ui.critical.submit
- ui.critical.cancel
- ui.critical.blocked
- ui.critical.state_viewed
- currently instrumented flows:
- Home Let's Eat tap and blocked reasons
- meal modal viewed/submit/cancel
- Home manual BG tap (element_id: home.bg_button)
- BG modal viewed/cancel and runtime submit-block reasons
- submit outcome semantics:
- manual BG submit is runtime-authoritative only
- meal submit uses a correlated lifecycle: submitted on user confirm, then runtime-emitted accepted / success / blocked / uncertain / resolved outcomes keyed by the meal flow_id
- Cloud log upload baseline wiring:
- app.log.batch
- local threshold + remote override policy model
- integration-test session logging baseline:
- persisted test_run_id session state
- persisted last-session summary so the prior run ID remains copyable after stop/expiry
- session-specific upload threshold override
- per-entry session metadata on app.log.batch
- explicit integration_test_session_started|stopped|expired marker batches
- DEBUG Home settings controls to start/stop sessions and copy either the active or most recent run ID
- reviewer workflow currently depends on the tester-provided run ID and UTC window because CloudWatch fanout still needs end-to-end metadata preservation verification
- Algo2015 native diagnostics (BP_LOG_<timestamp>.txt) are now tailed and emitted through app.log.batch with:
- persistent cursor checkpoint (fileName + offset) for relaunch resume
- prefix/summary parsing (A/B/C/D/I/G/~I/~G/AD/P/PA/MB/S, STEP=, *OUT:)
- severity mapping and step_hint extraction into metadata
- upload-cap enforcement with explicit drop-summary line to bound volume
- next development-support slice:
- add operator affordances such as scenario presets/share flow and canned CloudWatch retrieval guidance
- Auth/session guardrail:
- recovered/refreshed tokens are validated for telemetry ingest scope before persistence, avoiding re-persisting unusable scoped-out token sets
- CGM state guardrail:
- cgm.state.changed emission derives has_sensor from the latest callback/refresh state to avoid stale transition telemetry
- cgm.connection.changed is emitted after refresh so status_text aligns with current lifecycle state
- Structured algorithm inspection migration:
- algorithm.session.snapshot and algorithm.step.snapshot are now the app-side structured inspection source of truth during migration.
- Local BP_*.txt / BP_LOG_*.txt remain debug/equivalence artifacts only.
- Current structured BP coverage:
- full BP matrix row (bp)
- BP_LOG families A, I, G, ~I, ~G, AD/G24h, PA, P, B, C, D, S
- Remaining app-side open fields stay explicitly open:
- algorithm_build_id
- pump_id
- Implementation mirror: Docs/Planning/AlgorithmInspectionStructuredTelemetryPlan.md
Implementation references:
- /Users/jcostik/BionicLoop/BionicLoop/App/AuthSessionNetworking.swift (CloudTelemetryReporter)
- /Users/jcostik/BionicLoop/BionicLoop/Runtime/LoopRuntimeEngine.swift (loop/app/alert/pump emit calls)
- /Users/jcostik/BionicLoop/BionicLoop/Features/CGM/G7ViewModel.swift (CGM telemetry emit calls)
- /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/PumpStatusObserver.swift (pump telemetry emit calls)
- /Users/jcostik/BionicLoop/BionicLoop/Integrations/Pump/AppPumpManagerDelegate.swift (pod lifecycle telemetry emit calls)
- /Users/jcostik/BionicLoop/BionicLoopTests/BionicLoopInfrastructureTests.swift (envelope/outbox/retry/policy tests)
Known limitations (next slices):
- ui.critical.* canonical control instrumentation (J3) is complete app-side; remaining work is backend validation/query shaping.
- Backend schema enforcement/idempotent persistence/DLQ/replay remain open (J5).
- Algo2015 native diagnostics are app-side complete (J9); remaining work is backend-side validation/query shaping under J5.
Telemetry Objectives
- Capture all safety-relevant app, runtime, CGM, pump, and alert lifecycle events.
- Capture critical user interactions (and blocked actions) to reconstruct incident stories.
- Deliver events reliably with offline tolerance and deterministic retry behavior.
- Preserve privacy while keeping enough context for clinical/engineering support.
- Keep app telemetry transport independent from safe local loop execution.
Canonical Envelope (App -> Cloud)
Required now (cloud-enforced):
- event_type: String
- schema_version: String
- subject_id: String
- created_at: ISO-8601 UTC
- payload: Object
Required app-side contract (implemented):
- event_id: UUID
- session_id: UUID (app session correlation)
- app_version: String
- build_number: String
- app_env: String (dev|staging|prod)
- auth_user_sub: String (Cognito subject; never email, UNSET only in unauthenticated continuity mode)
Optional but strongly recommended:
- flow_id: UUID? (multi-step UX flows like meal announce/BG entry/setup)
- device_time_offset_sec: Int?
- ingest_priority: String (normal|high)
Event Taxonomy (Phase 1 Locked Set)
schema_version starts at 1.0.0 for all events in this phase.
1) App/Auth Lifecycle
app.lifecycle.launchedapp.lifecycle.foregroundedapp.lifecycle.backgroundedauth.session.authenticatedauth.session.signed_outauth.session.restore_failed
2) Loop Runtime and Algorithm
loop.session.armedloop.session.resetloop.step.executedloop.step.skippedloop.command.requestedloop.command.appliedloop.command.blocked
Required payload fields for loop.step.*:
- expected_step, executed_step, wake_cause, skip_reason, recommendation_applied
- implemented skip reasons now include mealSlotConflict for competing-trigger meal submits that lose the observed slot before coordinator acceptance; cloud handling should preserve it as an explicit blocked meal outcome rather than collapsing it into a cadence-only skip.
- algorithm_input_snapshot (sanitized)
- algorithm_output_snapshot
3) CGM
cgm.reading.processedcgm.reading.maskedcgm.connection.changedcgm.state.changed
Required payload fields:
- reading_timestamp, reliable, value_mgdl?, trend?, mask_reason?, source_state
4) Pump
pump.connection.changedpump.status.refreshedpump.command.resultpump.pod.lifecycle
Required payload fields:
- delivery_state, reservoir_units, insulin_delivered_total, bolus_not_delivered, pod_active, error_code?
5) Alerts
alert.issuedalert.retractedalert.acknowledgedalert.notification.scheduledalert.notification.clearedalert.notification.tapped
Required payload fields:
- alert_code, source, severity, dedupe_key, requires_acknowledge, ack_state, message
6) Critical UI Interaction Telemetry (Story Reconstruction)
Primary events:
- ui.critical.tap
- ui.critical.submit
- ui.critical.cancel
- ui.critical.blocked
- ui.critical.state_viewed
Required payload fields:
- screen_id
- element_id
- action
- result (success|blocked|failed|cancelled)
- reason (required for blocked/failed)
- flow_id (required for multi-step flows)
- linked_step?
- linked_alert_code?
Critical elements (minimum):
- home.start_algo_button
- home.reset_algo_button
- home.lets_eat_button
- meal.modal.meal_type_selector
- meal.modal.carb_relative_selector
- meal.modal.deliver_slider
- meal.modal.cancel_button
- home.bg_button
- bg.modal.value_picker
- bg.modal.submit_button
- bg.modal.cancel_button
- alerts.banner.acknowledge_button
- settings.open_cgm_settings
- settings.open_pump_settings
- cgm.setup.continue_button
- cgm.setup.cancel_button
- pump.setup.continue_button
- pump.setup.cancel_button
7) Telemetry System Health (self-observability)
telemetry.outbox.enqueuedtelemetry.flush.startedtelemetry.flush.succeededtelemetry.flush.failedtelemetry.event.dropped
Required payload fields: - queue depth, batch size, retry count, error class/code, dropped_count, oldest_pending_age_sec
CloudWatch App Logging Plan (Severity-Filtered Upload)
Goals
- Default cloud log upload to high-signal errors only.
- Allow local settings control to increase verbosity (
debug,info,warning,error). - Ensure selected threshold behavior is inclusive: selected level + higher severity are uploaded.
- Add long-term remote override so support teams can increase subject logging without asking the subject to change settings.
Level model
debug(lowest)infowarningerror(highest)
Default upload threshold: error.
Effective threshold resolution (current precedence)
- Active remote override (if present and not expired)
- Local user-selected threshold in settings
- App default (
error)
App-side design
- Add a structured app logger sink that emits canonical log DTOs:
timestamplevelsubsystemcategorymessage_templatemetadata(allowlisted)session_idflow_id?subject_id- Keep full local logging behavior unchanged; apply filtering only for cloud-upload stream.
- Batch logs for upload as telemetry events (
event_type: app.log.batch). - Enforce client-side guardrails:
- per-batch entry cap
- max payload size
- redaction/allowlist before enqueue
- rate limits for low-level logs (
debug/info) to control cost/noise
Settings UX status
- Implemented (debug builds): settings control
Min Upload Levelin Home settings. - Options:
Error(default),Warning,Info,Debug. - Persisted via
CloudLogUploadPolicy.localThresholdKey. - Filtering behavior is inclusive (
selected leveland higher severities upload). - Unit coverage currently includes default/invalid fallback, persisted threshold behavior, and remote-override precedence/expiry.
Remote override plan (long-term)
- Add backend config endpoint delivering subject-scoped logging policy:
upload_levelexpires_atreason_codeissued_by- App refreshes policy:
- at launch
- periodic foreground interval
- on silent push-triggered refresh (future)
- Override is temporary by design (TTL required) and auto-reverts to local/default when expired.
- All override applications/reversions emit telemetry audit events.
CloudWatch ingestion path (planned)
BionicScoutreceivesapp.log.batchvia/v1/telemetry.- Ingest validates log payload schema and writes structured entries to dedicated CloudWatch Log Group(s), partitioned by environment.
- Add subscription/retention policy:
- short retention for
debug/info - longer retention for
warning/error - Keep idempotency keying to prevent duplicate ingestion during retry.
Safety/privacy constraints
- No credentials/tokens/raw PHI in log payloads.
- Strict metadata allowlist for cloud-uploaded logs.
- Redaction tests required for sensitive fields.
- Logging transport failures must not affect loop safety behavior.
Algo2015 Native C++ Log Stream -> Cloud Plan
Current state (observed in source)
- Algo2015 emits verbose textual logs in two channels on Apple builds:
coutlines (USE_COUTenabled inAlgorithm_2015_10_13.cpp)- session file
BP_LOG_<timestamp>.txtin app Documents directory viaRecordConsole() - The native stream includes compact record prefixes and diagnostics, for example:
A/B/C/D/I/G/~I/~G/AD/P/PA/Srecords inLogFile- console summary lines like
STEP=...and*OUT: ... - File writes are flushed during step execution; stream is append-only by session.
- Separate algorithm data dumps (
BP_<timestamp>.txt, matrix helpers) are not appropriate for routine cloud upload.
Design goals
- Route Algo2015 native diagnostics to cloud without changing algorithm dosing behavior.
- Avoid scraping Xcode/device console; use deterministic in-app capture.
- Keep default upload volume low (
errorthreshold), with optional controlled escalation. - Preserve enough trace detail to correlate with step telemetry for incident reconstruction.
Recommended capture strategy (phased)
AL1. Near-term (low-risk): file-tail capture from BP_LOG_*.txt
- Implement app-side tailer that:
- discovers current
BP_LOG_*.txtfor active algorithm session - persists byte-offset checkpoint per log file
- reads only appended lines after each step / periodic flush
- Convert appended lines into structured log DTO entries with:
subsystem = algo2015.nativecategory = bp_lograw_linerecord_prefix(if present)step_hint(parsed from line where available)- Enqueue through shared cloud log pipeline (
app.log.batch).
AL2. Parsing and normalization
- Add lightweight parser for known prefix records (
A/B/C/D/...) and key-value summaries (STEP=,*OUT:). - Populate parsed metadata fields for queryability:
algo_step,cgm_value,bg_value,ireq,idel, etc. when parseable- Preserve raw line alongside parsed map for forensic continuity.
AL3. Severity mapping
- Map native lines to cloud log levels:
error: lines containing hard failures (e.g.,FAILED WRITE, fatal execution exits)warning: warning/degraded-path markers (WARNING,NO-GO, algorithm non-execution notices)info/debug: routine per-step trace lines (A/B/C/D,STEP=,*OUT)- Apply existing threshold model:
- default upload:
error - include lower levels only when local/remote threshold allows.
AL4. Correlation and volume controls
- Attach correlation IDs on each uploaded native line:
session_id,flow_id?,subject_id,event_idalgo_stepwhere parsed- Add guardrails:
- per-step max native lines uploaded
- batching and rate limits for
info/debug - drop-summary telemetry when limits are exceeded
AL5. Long-term optimization (optional)
- Replace/augment file-tail capture with bridge callback sink for direct line streaming from C++.
- Keep file-tail path as fallback for compatibility.
- Use callback path to reduce file I/O and simplify session/file-discovery logic.
Reliability Model (App)
Delivery guarantees
- Target: at-least-once delivery with idempotent replay.
- Event identity:
event_idgenerated once at enqueue time. - Ordering: preserve local event order using monotonic sequence (
sequence_no).
Outbox design
- Persistent outbox states:
pendinginflightackedfailed_permanent- Outbox survives app relaunch.
- Safety-critical events (
alert.*,loop.command.blocked, severe device faults) use priority flush.
Flush triggers
- app foreground
- enqueue when network reachable
- periodic timer while active
- background task windows
Retry policy
- Exponential backoff + jitter for transient failures.
- 4xx schema/auth failures -> mark permanent, emit
telemetry.event.droppedsummary event. - No retry path may block local loop operation.
Backpressure and retention
- Queue cap with explicit drop policy:
- never drop active safety-critical alerts before lower-priority UX events.
- emit deterministic drop summaries.
Cloud Evolution Plan (BionicScout)
C1. Contract hardening
- Add per-event schema registry keyed by
event_type + schema_version. - Reject unknown versions with explicit error details.
C2. Durable ingest
- Persist accepted events with idempotency key (
subject_id,event_id). - Add replay-safe writes and DLQ workflows.
C3. Routing and analytics
- Route to durable store and query path for timeline reconstruction.
- Keep retention/lifecycle classed by data criticality.
C4. Incident timeline support
- Query by
subject_id,session_id,flow_id, time window. - Build “what happened” timeline from device state + loop action + user action + alert lifecycle.
Phased Implementation Checklist
T1. Contract and schema freeze
- [x] Freeze event-type catalog and payload contracts.
- [x] Publish JSON examples for each event family.
- [x] Align app/cloud enum constants and naming rules.
T2. App telemetry SDK boundary
- [x] Define a single app telemetry emitter interface and DTO model.
- [x] Define source adapters for runtime, CGM, pump, alerts, and UI features.
- [x] Add field-level redaction/allowlist rules.
T3. Outbox and transport reliability
- [x] Implement persistent outbox with sequence/idempotency metadata.
- [x] Implement flush worker (batch, retry, backoff, priority path).
- [x] Add transport health metrics and drop accounting events.
T4. Source instrumentation
- [x] Wire loop/runtime/algorithm events.
- [x] Wire CGM and pump state/command events.
- [x] Wire full alert lifecycle events.
- [x] Wire critical UI interactions for flow storytelling.
T5. Cloud-side contract enforcement
- [x] Publish strict J5 backend contract packet (validation order, canonical error codes, idempotency conflict semantics, contract-test matrix) in
TelemetryCloudContractForBionicScout.md. - [ ] Implement per-event schema validation in ingest service.
- [ ] Implement idempotent durable persistence.
- [ ] Add DLQ triage/replay tooling.
T6. Story reconstruction workflows
- [ ] Define incident timeline query API/report format.
- [x] Add support playbook examples for “subject stuck” scenarios.
- [ ] Validate timeline completeness against known failure drills.
Minimum “subject stuck” timeline completeness criteria (app + backend contract):
- At least one app.lifecycle.* event in the episode window with session_id and app_env.
- Full ui.critical.* chain for the blocked workflow (tap -> optional state_viewed -> submit/cancel/blocked) with stable flow_id.
- Corresponding loop/pump/cgm state transitions (loop.step.*, pump.status.refreshed, cgm.connection.changed, cgm.state.changed) within the same time window.
- Alert lifecycle continuity for involved alerts (alert.issued and alert.retracted or alert.acknowledged) with stable dedupe_key.
- If a command was blocked, include a machine reason (reason) and related context (linked_alert_code when applicable).
- If a step executed, timeline consumers must prioritize payload.step_executed_at over envelope created_at for algorithm-time reconstruction.
T7. Verification and evidence
- [ ] Add unit tests for event encoding/redaction/outbox state machine.
- [ ] Add integration tests for auth scope and ingest responses.
- [ ] Add end-to-end replay tests for incident storytelling completeness.
- [ ] Publish
STR-CLOUD-*evidence artifacts for traceability.
T8. CloudWatch severity-filtered logging rollout
- [x] Implement app log DTO schema and cloud-upload logger sink.
- [x] Implement threshold filter logic (
selected + above) with defaulterror. - [x] Add settings UI for local threshold selection and persistence.
- [ ] Add remote override policy fetch/apply/expiry behavior with audit telemetry.
- [ ] Add cloud validation and CloudWatch write path for
app.log.batch. - [ ] Add tests for threshold logic, remote override precedence, TTL expiry, and redaction.
T9. Algo2015 native log routing
- [x] Implement
BP_LOGtailer with persistent offsets and session-file discovery. - [x] Parse and normalize native record prefixes/key-value summaries into structured metadata.
- [x] Map native lines to severity levels and apply threshold filtering defaults.
- [x] Correlate native log uploads with step/session IDs and existing loop telemetry.
- [x] Add tests for parser correctness, offset resume behavior, and duplicate suppression on relaunch.
- [ ] Add load/volume tests to ensure native trace uploads do not starve higher-priority telemetry.
Verification Strategy (Planned)
- Deterministic fixture-driven tests for event payload generation.
- Fault-injection tests for network loss, auth expiry, and 4xx/5xx ingest responses.
- Replay test that proves a single timeline can answer:
- what devices reported
- what loop executed/skipped/applied/blocked
- what alerts were shown/acknowledged/cleared
- what user actions occurred on critical controls
Open Decisions Requiring Team Input
- Final retention policy for event classes and timeline history depth.
- Priority/severity matrix for notification fan-out.
- Dashboard audiences and role-scoped query views for first release.
- Exact cloud persistence/query stack sequence in
BionicScout(minimal now vs full architecture rollout). - Remote override policy governance:
- who can issue/approve override
- max allowed override duration
- whether
debugrequires additional approval in production - Whether to keep long-term native capture on file-tail path or invest in bridge callback streaming in phase 2.
Related Documents
/Users/jcostik/BionicLoop/Docs/Planning/ExecutionPlan.md(Workstream J)/Users/jcostik/BionicLoop/Docs/Planning/DevChangePlan.md(implementation touchpoints)/Users/jcostik/BionicScout/Docs/Planning/CloudBackendRequirements.md/Users/jcostik/BionicScout/Docs/Planning/CloudBackendExecutionStatus-2026-02-19-TelemetryIngest.md