# Full Regression Test Plan > **Status:** Planning. No tests implemented yet — see [§4 Phased implementation](#4-phased-implementation). > > **Authoritative source:** issues #114–#120 own the per-phase contracts. This document has been reconciled against them as of 2026-05-22 (#114 deliverable 9); where future review iterations on those issues add detail, the issue text wins. > > **Companion docs:** [plan-smoke-tests.md](plan-smoke-tests.md) (existing PR-gated smoke suite), [specs.md](specs.md) (ACs the regression suite covers), [TESTING_SETUP.md](../TESTING_SETUP.md) (GMD / Robolectric infrastructure). ## 1. Goals & principles - **Catch real-device bugs before release** — SAF, camera, FileProvider, billing, lifecycle, perf. Things JVM/Robolectric structurally can't catch. - **Functional regression, not duplication** — don't re-test what `*Test.kt` / `*IT.kt` / `*ScreenshotTest.kt` already cover deterministically on JVM. The regression suite owns *behavior that depends on the real OS*. - **Independent tests, deterministic setup** — every scenario resets DataStore + vault state (extend existing `resetYorvanaState` / `installFreshVault`). No accumulated state across tests. - **Annotation-tiered, single source tree** — keep tests in `app/src/androidTest`, add custom marker annotations so PR-gated vs nightly is an AndroidJUnitRunner `annotation` filter, not a separate codebase. - **Flake budget: zero on PR-gated tier, ≤1% on nightly** — quarantine, don't ignore. Flaky tests get a `@Quarantined` marker annotation; PR-gated and core regression tasks pass `notAnnotation=com.yorvana.testsupport.tiers.Quarantined` so these tests are excluded from gating, while the nightly `regressionFullCheck` task does *not* set `notAnnotation` so they still execute and report. ## 2. Technical approach Stay on the existing stack — `AndroidJUnit4` + `createAndroidComposeRule()` + GMD Pixel 2 API 33 — and add the missing pieces: | Concern | Tool | Where it lives | |---|---|---| | Real SAF picker, real file picker, real camera | **UiAutomator** (`UiDevice`) | `androidTest/.../regression/system/` | | Process death | `ActivityScenario.recreate()` + `Instrumentation.callActivityOn{SaveInstanceState,Pause}` + kill via `am` | `regression/lifecycle/` | | Configuration changes | `composeTestRule.activityRule.scenario.recreate()` + `Configuration` overrides | `regression/lifecycle/` | | Google Play Billing sandbox | Test APK built from a non-debug build type signed with the same upload key as the Play internal track (the app module currently defaults instrumentation to `testBuildType = "debug"` per `app/build.gradle.kts:68`, which produces a debug-signed APK Play won't recognise as the internal-track artifact) + license tester accounts on a **Play-enabled device** (Play Store system image — `google_apis_playstore`, not the AOSP-only GMD pool; see [§5 Billing sandbox device path](#5-billing-sandbox-device-path)); gated by the **multi-precondition `Assume.assumeTrue` skip guard** from #118 (`BILLING_SANDBOX_ACCOUNT`, signing material, selected device path, signed-in Play account, and a `BillingClient.queryProductDetailsAsync(premium_lifetime)` non-empty result) so any missing prerequisite skips cleanly with a specific assumption message — *not* a narrow `BILLING_SANDBOX_ACCOUNT`-only gate | `regression/billing/` | | Performance NFRs | **Macrobenchmark** (`androidx.benchmark`) in a separate module, running on its own dedicated GMD entry registered inside `:macrobenchmark` (a Pixel 2 API 33 entry matching the app's GMD, generating a task like `pixel2api33BenchmarkAndroidTest`). Not the app's smoke `pixel2api33` device, and not `connected*AndroidTest` (which would target ad-hoc connected devices, which CI doesn't have). Numbers from a swiftshader/angle emulator are advisory regression signal, not authoritative perf measurements. | new `:macrobenchmark` module | | Fixture data | Bundle only a **minimal** asset set (one small PDF in Phase 1; JPG follows in Phase 2 if needed) in `app/src/androidTest/assets/`. Per #114 deliverable 6, large datasets (20 vehicles, 200 records, attachment-heavy vaults) are **generated in-process** by Phase 2/4/6 helpers rather than checked in as snapshots, so the repo doesn't carry binary asset bloat. | bundled in `app/src/androidTest/assets/` and copied to a known device path in `@Before` (matches smoke plan's recommendation; self-contained on GMD, no `adb` round-trip) | ### Test tiers (custom marker annotations + AndroidJUnitRunner filters) The project is on JUnit 4 with `AndroidJUnitRunner` (see `app/build.gradle.kts`), which does **not** support a `tag` filter — it filters via `annotation` / `notAnnotation` / `package` / `class` arguments. Define a small set of runtime-retained marker annotations in `app/src/androidTest/.../testsupport/tiers/` (`Smoke`, `Regression`, `RegressionFull`, `BillingSandbox`, `Quarantined`) and apply them at the test-method or class level. All five tier markers carry `@java.lang.annotation.Inherited` so AndroidJUnitRunner's `Class.getAnnotation()` filter and the Phase 1 annotation-misuse scanner walk the same scope at superclass-class level. `@Inherited` only affects class-level annotation queries — method-level annotations are never inherited in Java/Kotlin regardless of the meta-annotation, so the scanner does not invent inheritance the runtime filter cannot honor. | Tier | Marker | Runs | Budget | |---|---|---|---| | Smoke | `@Smoke` *(existing scenarios re-annotated)* | PR (label-gated, unchanged) | <2 min | | Regression — core | `@Regression` | PR (label-gated, new) + nightly | ~15 min | | Regression — full | `@RegressionFull` | Nightly only | ~45–60 min | | Regression — sandbox | `@BillingSandbox` | Nightly with creds | adds ~10 min | | Macrobenchmark | separate module | Nightly | ~15 min | New Gradle tasks (marker classes live in `com.yorvana.testsupport.tiers.*`; AndroidJUnitRunner's `annotation` / `notAnnotation` arguments need the fully-qualified class name, so the `-P...` examples below are spelled in full for copy-paste fidelity): - **`regressionCheck`** — PR/core tier. Single `pixel2api33DebugAndroidTest` invocation with `-Pandroid.testInstrumentationRunnerArguments.annotation=com.yorvana.testsupport.tiers.Regression -Pandroid.testInstrumentationRunnerArguments.notAnnotation=com.yorvana.testsupport.tiers.Quarantined`. Excludes `RegressionFull` and `BillingSandbox` (neither carries `@Regression`). - **`regressionFullCheck`** — Nightly. Aggregator task that depends on the sibling tasks below, plus eventually the macrobenchmark module. It does *not* `dependsOn(regressionCheck)`, because that task's `notAnnotation=...Quarantined` filter would otherwise hide quarantined core tests from nightly too (contradicting §1's quarantine policy). **Phase 1 (#114) wires only the first two siblings**. Phase 5 (#118) chose the external-device billing path, so the sandbox tier runs in `billing-sandbox.yml` rather than under `regressionFullCheck`; the `:macrobenchmark` `dependsOn` is deferred to **Phase 6 (#119)**. - Mechanism: because `-Pandroid.testInstrumentationRunnerArguments.*` is build-wide (set on the root `Project` before AGP reads it) and the same GMD task can only execute once per Gradle invocation, the three filter passes cannot be expressed by re-running one task with different `-P` values inside a single build. Instead each pass is its own task. Two viable shapes; Phase 1 will pick one: - **(a) `GradleBuild` child invocations** — register three `GradleBuild` tasks under `regressionFullCheck`, each setting `startParameter.projectProperties` for its own runner-argument set and `tasks = ["pixel2api33DebugAndroidTest"]` (or the Play device task for sandbox). Each child invocation gets a fresh Gradle build, so `-P` values do not collide. - **(b) Variant- or device-scoped tasks** — declare extra GMD device entries (or a custom test-only variant) whose `instrumentationRunnerArguments` are set in `build.gradle.kts` rather than at the CLI, so each generated `*DebugAndroidTest` task carries its own baked-in filter. `regressionFullCheck` then plain-`dependsOn`s those tasks. - Logical tiers wired in (under whichever shape is chosen): - `regressionFullCoreCheck` — `pixel2api33DebugAndroidTest` with `annotation=com.yorvana.testsupport.tiers.Regression` and **no** `notAnnotation`, so flaky `@Regression @Quarantined` tests still execute and report. - `regressionFullExtendedCheck` — `pixel2api33DebugAndroidTest` with `annotation=com.yorvana.testsupport.tiers.RegressionFull`. - `regressionFullSandboxCheck` — runs `@BillingSandbox` against the Play-enabled device path from §5, using a test APK built from a non-debug build type signed with the same upload key as the Play internal track (e.g. a `staging` `testBuildType` distinct from `debug`/`release` and outside the screenshot-build branch in `app/build.gradle.kts:43`). The concrete task is therefore something like `pixel2api33PlayStagingAndroidTest`, *not* `pixel2api33PlayDebugAndroidTest` — a debug-signed APK is the wrong artifact for real Play Billing. Annotation filter: `annotation=com.yorvana.testsupport.tiers.BillingSandbox`; tier selection is in-build only — runtime skip is owned by the **multi-precondition `Assume.assumeTrue` skip guard** from #118 (`BILLING_SANDBOX_ACCOUNT`, signing material, selected device path, signed-in Play account, and a non-empty `BillingClient.queryProductDetailsAsync(premium_lifetime)` result), *not* a narrow `BILLING_SANDBOX_ACCOUNT`-only gate. If §5 picks the external-workflow option instead, this aggregator entry is omitted (the sandbox tier runs in `billing-sandbox.yml`, not under `regressionFullCheck`). - None of the three set `notAnnotation`. The `:macrobenchmark` module's GMD-generated benchmark task (e.g. `:macrobenchmark:pixel2api33BenchmarkAndroidTest`, from the dedicated GMD entry registered in `:macrobenchmark` per Phase 6 and the §2 table) is wired as an additional `dependsOn` to keep "core + full + sandbox + benchmark" in one entry point. It is deliberately *not* `:macrobenchmark:connectedBenchmarkAndroidTest` — `connected*AndroidTest` targets ad-hoc currently-connected devices that CI doesn't have, so the nightly aggregate could otherwise pass without ever running the benchmark tier. (Separate tiers rather than a comma list because AndroidJUnitRunner's `annotation` argument takes a single class, and the sandbox tier needs a different device target anyway.) - **Smoke** — the existing PR-gated path. Today the `pixel2api33` device's `instrumentationRunnerArguments` in `app/build.gradle.kts` only sets `notPackage=com.yorvana.screenshots`, and `.github/workflows/smoke.yml` runs plain `./gradlew pixel2api33Check`. Once regression tests start landing in the same `app/src/androidTest` tree, that filter would sweep them into smoke and blow the <2 min budget. **Phase 1 (#114) isolates smoke via a positive filter** — `annotation=com.yorvana.testsupport.tiers.Smoke` is added to the smoke device's `instrumentationRunnerArguments` alongside the existing `notPackage=com.yorvana.screenshots`, and `@Smoke` annotation of every existing smoke test is a hard Phase 1 prerequisite. The negative-filter alternative (`notAnnotation=Regression,...`) was rejected because any future unannotated `androidTest` class would default into smoke and silently blow the budget. The change lives in `app/build.gradle.kts` (smoke device's runner args) so `smoke.yml`'s `./gradlew pixel2api33Check` invocation stays unchanged. The runner-args injection must be **scoped by exact device/task-name match** (not `contains("pixel2api33")`) so a future `pixel2api33PlayStagingDebugAndroidTest` introduced in Phase 5 does **not** inherit the smoke `annotation=Smoke` filter. ## 3. Test plan — coverage map Mapped to [specs.md](specs.md) ACs. The principle: every AC has either a JVM test (existing) **or** a regression scenario, with regression earning its place by needing real OS behavior. ### 3.1 Vehicle CRUD (FR-V1..V4) - **R-V01** — Add vehicle with all optional fields filled, kill process, relaunch → vehicle present with all fields preserved on disk (vault JSON readable). - **R-V02** — Duplicate VIN guard: add vehicle with VIN, try to add another with same VIN → blocked with error. - **R-V03** — Field length / range validation negative paths: nickname >50 chars, year=1899, year=2101 → save blocked, error visible. - **R-V04** — Edit vehicle, rotate device mid-edit → form state preserved (rememberSaveable). - **R-V05** — Delete vehicle with 50+ records + 20 attachments → all files removed from the **file-backed vault tree** (assertions read the `installFreshVault` root directly, not `DocumentFile`). Real SAF tree-URI coverage of this scenario is deferred to Phase 3 (#116). - **R-V06** — External vault edit: pre-seed the **file-backed vault tree** with an extra vehicle file directly (no `DocumentFile`), relaunch app → vehicle appears in list (catches cache-staleness bugs). Real SAF tree-URI coverage is deferred to Phase 3 (#116). - **R-V07** — VIN copy-to-clipboard: trigger the copy action on a vehicle with a VIN, read back via `ClipboardManager.primaryClip` and assert the text matches (real OS clipboard, not a mocked manager). ### 3.2 Records (FR-R1..R4) - **R-R01** — Add record with all fields, kill app mid-form → on relaunch, no partial record (transactional save). - **R-R02** — Odometer/cost boundary validation (0, max, max+1, decimals). - **R-R03** — 200-record list **correctness only**: scroll reaches first and last records, item order is stable, no crash or state loss. Frame-timing / jank measurement belongs to **M-03** (Phase 6 #119) — no `FrameTimingMetric` / Macrobenchmark code lands in Phase 2. - **R-R04** — Delete record → confirm attachment files are gone from disk. ### 3.3 Attachments (FR-A1, FR-A2) — *the big real-OS bucket* The app uses `ActivityResultContracts.TakePicture()` with a `FileProvider` output URI and **does not declare or request the `CAMERA` runtime permission**. Camera scenarios here therefore do not assert a runtime camera permission dialog and do not run `pm revoke android.permission.CAMERA`; the capture failure path is driven by cancelled/failed capture, not permission denial. External-viewer assertions follow a "resolver-presence gate": assert `ACTION_VIEW` launches; only assert external viewer UI when `packageManager.resolveActivity(...) != null` on the GMD image. - **R-A01** — Real SAF file picker via UiAutomator: pick a PDF (bundled in `androidTest/assets` and copied into `MediaStore` Downloads by the fixture `TestRule` under a unique display name per scenario, cleaned up in `@After`), save record, kill app, relaunch → attachment persists, URI resolves, `ACTION_VIEW` fires; assert external UI only if a handler resolves. - **R-A02** — Real camera capture via UiAutomator using the real `TakePicture()` + `FileProvider` flow, photo persists as JPG in vault. No `CAMERA` permission dialog assertion. - **R-A03** — Image viewer gestures (pinch zoom, double-tap zoom toggle, pan with bounds, rotation-state contract). **Owned by Phase 4 (#117)** under a UI-gestures sub-section — gestures don't depend on real OS surfaces, and Phase 4 ships the production `IMAGE_VIEWER_SURFACE` / `IMAGE_VIEWER_IMAGE` test tags and the user-observable assertion stance (no reads of private viewer scale/offset state). - **R-A04** — Non-image attachment → system viewer launches. Same resolver-presence gate as R-A01. - **R-A05** — Remove attachment with confirmation, save → file is deleted from disk (via `DocumentFile`). - **R-A06** — Vault folder change mid-session, attachments still resolvable. Scoped to attachment-URI resolution; the move-vs-start-fresh contract is R-S02's responsibility — both scenarios ship. - **R-A07** — **Cancelled / failed capture recovery** (replaces the previous camera-permission-denial framing, which was invalid against the app's no-`CAMERA`-permission contract): trigger the capture flow, cancel the system camera (or simulate `RESULT_CANCELED`) → app surfaces a clear non-blocking error state, no partial JPG left in the vault, retry succeeds. ### 3.4 Vault / SAF (FR-D1) - **R-D01** — Fresh setup: drive real SAF folder picker via UiAutomator, pick a folder, complete setup. - **R-D02** — Pre-existing vault: point at a folder with sample data → vehicles + records load. - **R-D03** — Vault permission revoked externally (clear persistable URI perms) → app shows recoverable error state, not crash. - **R-D04** — Malformed `vehicle.json` in vault → app skips that vehicle and logs, doesn't crash the list. - **R-D05** — Migration: pre-seed an old-format vault, verify `MigrationHelper` upgrades it. **Caveat:** `MigrationHelper` today only passes v1 through and logs unknown versions, so R-D05 only has meaningful content once a concrete legacy schema fixture (`regression-fixtures/legacy-vault-v0/`) and an "expected upgraded shape" reference are landed. Per #116, Phase 3 either ships those fixtures or removes R-D05 from that phase and reopens when a real legacy schema is in flight. ### 3.5 Categories (FR-C1, FR-C2) - **R-C01** — Default categories present on fresh install. - **R-C02** — Add custom category → appears in selector across vehicles → persists across app kill. - **R-C03** — Delete custom category in use by records → confirm prompt mentions affected record count. ### 3.6 Settings (FR-S1..S3) - **R-S01** — Toggle odometer unit → existing records re-render with new unit (already in smoke partially; add the "with records present" path). - **R-S02** — Change vault folder via Settings against the **real SAF folder picker**. Three asserted paths per ADR-005 (full contract owned by Phase 3 #116): 1. **Move path** — pre-seed source vault (1 vehicle + 1 record + 1 attachment), pick new tree URI, choose **Move** → progress indicator shown, byte-identical copy verified at destination, source tree deleted only after verification, active vault pointer flips, reopen boots into new tree. 2. **Start-fresh path** — same starting state, choose **Start fresh** → active vault pointer flips, source tree left untouched byte-for-byte, new tree starts empty, reopen boots into empty new tree. 3. **Abort/failure path** — induce a write failure **after copy has begun** (e.g. ENOSPC via `sm set-virtual-disk`) → abort, source tree intact byte-for-byte, active vault pointer unchanged, UI surfaces a clear error. Pre-conditions that block the move from starting (destination-not-empty) are out of scope — already covered by `VaultStorageImplIT.moveVaultTo_should_fail_when_destination_is_not_empty`. - **R-S03** — Crash reporting opt-in persists across kill + restart (extension of existing `s21`). - **R-S04** — Custom currency persistence: set a custom (non-list) currency code via `CustomCurrencyDialog`, kill app, relaunch → code is rendered on existing record cost lines and used as the default for new records (exercises DataStore round-trip for a user-supplied string). ### 3.7 Paywall (FR-P1..P6) — sandbox-dependent - **R-P01** — Free tier, attempt 2nd vehicle → upgrade dialog. **Already covered** by `SmokeTest.upgradeDialogOnFreeTier` ("S18", `app/src/androidTest/java/com/yorvana/SmokeTest.kt:249`). No new regression test in any phase; if a kill+restore variant is wanted, it ships as a follow-up. - **R-P02** — Read-only mode on free tier across screens. **Already covered** by `SmokeTest.readOnlyBannerAcrossScreens` ("S19", `SmokeTest.kt:271`). No new regression test in any phase. - **R-P03** — *Sandbox*: complete real test purchase via Play Billing. Full ADR-012 contract asserted by Phase 5 (#118): paywall sheet → Play Billing dialog → license-tester purchase → `isPremium=true` in `AppPreferences`, sheet dismisses, **app auto-navigates to Add Vehicle** when the purchase was launched from the at-cap FAB (pending-add-navigation flag survives the billing round-trip). Asserting only "dialog dismisses" is insufficient. `premium_lifetime` is non-consumable, so Phase 5 must pick an entitlement-repeatability strategy (disposable tester accounts / manual reset / split contract) before this test ships nightly. - **R-P04** — *Sandbox*: restore purchases on a fresh install with the same Google account. "Fresh install" defined explicitly (default: `pm clear com.yorvana`); asserts `false → restore → true → Settings re-renders` transition. Idempotent against the owned product — the nightly-repeatable sandbox check. - **R-P05** — **Offline cached premium** (not sandbox-dependent — owned by Phase 2 #115 as `@Regression`). Invariant at the offline assertion block: `DebugBillingOverrideMode == NONE` **and** `billingDerivedPremium == null`, so `isPremium` resolves only through `cachedPremium` per `billingPremium ?: cachedPremium ?: false` (`BillingManagerImpl.kt:94` / `:102`). Two seeding paths in `@Before`: 1. **Preferred** — seed the cache via the existing production API `AppPreferencesStore.setIsPremiumCached(true)` (`AppPreferences.kt:27` / `AppPreferencesImpl.kt:70`); no test-only `AppPreferences` surface introduced and `billingDerivedPremium` is never touched. 2. **Alternative** — `DebugBillingActions.simulatePurchaseSuccess()`, but then the test **must** kill-and-relaunch (or otherwise recreate `BillingManagerImpl`) before going offline so `billingDerivedPremium` resets to `null`. Same-process offline assertion after `simulatePurchaseSuccess()` without a recreate is explicitly forbidden — it would pass via `billingPremium`, not `cachedPremium`. Offline mechanism is **device-level** (`svc wifi disable` + `svc data disable`, or airplane mode if reliable on the GMD image) — no app-level fake connectivity. `@After` restores connectivity unconditionally. Skip with `Assume.assumeTrue` if the GMD image disallows the chosen control. ### 3.8 Lifecycle & robustness (no specific AC, but bug-rich) - **R-L01** — Rotate device on every screen (parametrized). **Persistent state contract** (asserted): any field in `rememberSaveable` or a ViewModel `StateFlow` survives rotation — form drafts (vehicle edit, record edit) included. **Transient state contract** (out of scope unless lifted to saveable in the same PR): plain `remember` state (open menus, snackbars, transient dialogs). Per-screen test KDoc documents which class each surface falls under. - **R-L02** — Process death on every screen. Two explicit helpers (no implicit mixing): - `ProcessDeath.recreate(activityRule)` → `activityRule.scenario.recreate()` (state-preservation use, restores from `onSaveInstanceState` bundle only). - `ProcessDeath.kill()` → `UiDevice.pressHome()` → `am kill com.yorvana`, relaunch via launcher intent. **Do not use `force-stop`** — different launcher/task semantics produce a cold start that masks navigation-stack restore bugs. Default contract for this phase: **route restoration + persisted data only**. Unsaved form drafts are covered by R-V01 / R-R01, not here. - **R-L03** — Low-memory simulation (`am send-trim-memory`) → no crash. - **R-L04** — Back-press exhaustively from deep screens. - **R-L05** — **Launcher re-entry** while on a deep screen (renamed from "deep-link / intent re-entry"). The app does not declare any app-deep-link `` today, so this scenario asserts task-affinity / `singleTask` behavior: launching from the launcher with an existing task on a deep screen returns to that screen. Real deep-link routing is out of scope and requires manifest + `MainActivity` work in a separate issue. - **R-L06** — Disk full on save. Assertion split by vault backend because the two paths give different guarantees: - **File-backed (`VaultNode.Local`)** — production prerequisite: same-directory `vehicle.json.tmp` + `fsync` + `java.nio.file.Files.move(..., ATOMIC_MOVE, REPLACE_EXISTING)` (minSdk 26). Any `File.renameTo()` fallback must check the boolean return and treat `false` as a failed save. Test asserts the existing JSON is **byte-identical** to its pre-write contents after an induced ENOSPC. - **SAF (`VaultNode.Remote`)** — production prerequisite: documented staged-write protocol (write `.tmp` via SAF, verify length + checksum, promote via delete-existing + rename, orphan-`.tmp` recovery on next open). `DocumentFile.renameTo()` is **not** a guaranteed atomic primitive across SAF providers, so the test asserts a **narrower** contract: clear error, active vault pointer unchanged, no orphan `.tmp` past the next open, retry-after-free-space succeeds. The byte-identical-pre-write assertion does **not** apply to the SAF path. If either production change is descoped, the corresponding backend narrows to "clear error, no crash, recoverable retry" in this PR's test KDoc (no silent assertion drop). - **R-L07** — Concurrent external edit. Scoped to **backgrounded edit + resume**: background via `UiDevice.pressHome()`, mutate `vehicle.json` (or `records/{uuid}.json`) externally via `DocumentFile`, resume → list/detail UI reflects the new value without crash. Requires a production-side refresh-on-resume hook (e.g. `onStart` repository invalidate); if neither that nor pull-to-refresh exists, Phase 4 either lands the hook or narrows R-L07 to cold-start. Foreground-while-running mutation (`FileObserver` against a SAF tree) is out of scope this phase. ### 3.9 NFRs (NFR-1, NFR-6) — Macrobenchmark module - **M-01** — Cold start (`StartupTimingMetric`): assert P50 ≤ 2000ms. - **M-02** — Vehicle list scroll FPS with 20 vehicles (`FrameTimingMetric`). - **M-03** — Record list scroll FPS with 200 records. - **M-04** — Navigation transition duration garage→detail→record. Either `TraceSectionMetric` against production `androidx.tracing` sections (`nav.garage_to_detail`, `nav.detail_to_record`) whose **span covers the destination composition + transition animation frames, not just the synchronous nav event** (wrapping only the click handler / back-stack mutation would measure command latency, not perceived transition), **or** a `FrameTimingMetric` over a scripted nav journey if a reliable transition-work span can't be established. Phase 6 (#119) picks one and documents the choice in the module README. ## 4. Phased implementation ### Phase 1 — Infrastructure (1–2 days) — issue **#114** - Add custom marker annotations (`@Smoke`, `@Regression`, `@RegressionFull`, `@BillingSandbox`, `@Quarantined`) under `testsupport/tiers/`, **all five carrying `@java.lang.annotation.Inherited`** so the runtime filter and the scanner walk superclass-class annotations the same way. - Wire `regressionCheck` and `regressionFullCheck` Gradle tasks per §2 using a **single chosen mechanism** (either `GradleBuild` child invocations with `startParameter.projectProperties`, or variant/device-scoped tasks with baked-in `instrumentationRunnerArguments`) — same shape for both, documented in the PR description. `regressionFullCheck` in Phase 1 wires **two siblings** (`regressionFullCoreCheck`, `regressionFullExtendedCheck`); `regressionFullSandboxCheck` is deferred to Phase 5 and the `:macrobenchmark` dependency to Phase 6. - **Smoke isolation via positive filter**: update the `pixel2api33` device's `instrumentationRunnerArguments` to add `annotation=com.yorvana.testsupport.tiers.Smoke` alongside `notPackage=com.yorvana.screenshots`; annotate every existing smoke test with `@Smoke` (hard prerequisite). The injection must be scoped by exact device/task-name match so a future `pixel2api33PlayStagingDebugAndroidTest` from Phase 5 does not inherit the smoke filter. `.github/workflows/smoke.yml` keeps invoking `./gradlew pixel2api33Check` unchanged. - Create only the `testsupport/tiers/` package under `androidTest/` for Phase 1 — feature subpackages (`vehicles/`, `records/`, …) are created lazily by the phase adding the first scenario in each area. - Add UiAutomator helpers in `testsupport/`: `SystemPicker.pickFile()`, `SystemPicker.pickFolder()`, `Camera.captureAndAccept()`, `Permissions.acceptIfShown()` (stubs acceptable, API surface settled). - Bundle a **minimal** fixture set (one small PDF in Phase 1; JPG follows in Phase 2 if needed) in `app/src/androidTest/assets/` and the `@Before` / `TestRule` helper that copies it to the device-side location (Downloads via `MediaStore` for picker-visible files, app cache for app-internal reads). No large-vault snapshots — later phases generate large datasets in-process. - Add the **annotation-misuse guardrail** (source-level scanner under `src/test` OR Gradle bytecode task scanning `compileDebugAndroidTestKotlin` output via ASM). Lookup model: method-level → class-level → superclass class-level (the inheritance step is valid only because of `@Inherited`). Quarantine-pairing rule, forward-compatibility with annotation arguments (`@Quarantined(since,issue,reason)`, `@ScenarioId`), parameterized-test handling, and four negative-test fixtures (zero tiers, two tiers, orphan quarantine, inherited-tier sanity that pins scanner / runtime-filter agreement). - Add the nightly workflow pair: `regression.yml` (`workflow_call` + `workflow_dispatch` + `schedule`, runs `./gradlew regressionCheck` on label/manual/reusable paths and `regressionFullCheck` on the cron path, mirrors `smoke.yml`'s KVM setup, forces `swiftshader_indirect`, uploads managed-device reports) and `regression-label.yml` (`pull_request: [labeled]`, gated on `regression` label). ### Phase 2 — Vehicle + Record + Categories + Settings regression (3–4 days) — issue **#115** - Write R-V01..07, R-R01..04, R-C01..03, R-S01 / R-S03 / R-S04, plus the new `@Regression` scenario **R-P05** (offline cached premium; not sandbox-dependent — see §3.7). R-S02 is **deferred to Phase 3** because it needs the real SAF picker. - R-P01 and R-P02 are **not** re-implemented — they are explicitly attributed to existing smoke tests `SmokeTest.upgradeDialogOnFreeTier` and `SmokeTest.readOnlyBannerAcrossScreens`. - Shared helpers in `testsupport/`: `killAndRelaunch()`, `rotateMidEdit(composeRule)`, repository-layer seeding helpers (no UI-driven 50-record fixtures), `Network.setOffline()` / `Network.setOnline()` (device-level `svc wifi disable` / `svc data disable` for R-P05; `@After`-restore unconditional; `Assume.assumeTrue` skip if the GMD image disallows the control). - These have the best bug-catch-per-engineering-hour ratio and surface a lot of edge cases (validation, boundaries, vault round-trip) that aren't currently tested on real OS. ### Phase 3 — Real SAF + file/camera + R-S02 (3–5 days) — issue **#116** - Build the UiAutomator + real-SAF helpers; write R-D01..05, R-S02, R-A01/02/04/05/06/07. R-A03 is **routed to Phase 4** (UI gestures don't need real OS surfaces). - New `testsupport/` primitives: `RealSafVault.create()` / `useExisting(treeUri)` (real tree URI with persistable permission, seeded via `DocumentFile`), `Camera.captureAndAccept()` (no permission dialog — app does not declare `CAMERA`), `ExternalViewer.assertLaunchOrSkipUi(intent)` (resolver-presence gate for R-A01 / R-A04). `installFreshVault()` stays file-backed and is not used for this phase's real-SAF scenarios. - Fixture acceptance: tiny PDF asset (<20 KB, valid `%PDF-1.4`, documented SHA-256), `MediaStore` Downloads insertion under unique display names per scenario with `@After` cleanup, R-D05 legacy fixture (or R-D05 dropped from this phase). - `@Quarantined` policy defined here (inherited by later phases): flake ≥1× in 20 nightly runs; core/PR tasks exclude via `notAnnotation`; nightly includes; on quarantine open a tracking issue with 14-day review; **30 consecutive green nightly runs** required to de-quarantine (Phase 7 #120 is the authority — if it revises the threshold, update here in the same PR). - Expect flakes — pin documents UI package, build retry helper, screenshot-on-failure (extend GMD `additional-test-output` already wired up). ### Phase 4 — Lifecycle + process death + R-A03 (2 days) — issue **#117** - R-L01..07 and **R-A03** (image viewer gestures, routed here because gestures don't depend on real OS surfaces). - Parametrize R-L01 / R-L02 across the screen list; long parametrized runs carry `@RegressionFull`, one canonical rotation + one canonical kill stay in `@Regression`. - Helpers under `testsupport/lifecycle/`: `ProcessDeath.recreate` (state-preserve) and `ProcessDeath.kill` (hard kill via `am kill`, **never** `force-stop`). Gesture helpers under `testsupport/gestures/`: `Gestures.pinch / .doubleTap / .swipe` — distinct from Phase 3's `testsupport/system/`. - R-L06 production prerequisites are **per-backend** (see §3.8): file-backed `Files.move(..., ATOMIC_MOVE, REPLACE_EXISTING)` with checked `renameTo` fallback (any `false` return treated as failed save), SAF documented staged-write protocol with orphan-`.tmp` recovery. Either backend can narrow to "clear error, no crash, recoverable retry" if its production change is descoped, with the narrowing documented in the test KDoc. - R-A03 needs production test tags `TestTags.IMAGE_VIEWER_SURFACE` and `TestTags.IMAGE_VIEWER_IMAGE`, locating-only (do **not** expose scale/offset/matrix state via semantics). Assertions are user-observable (rendered bounds via `boundsInRoot`), never reads of private viewer state. Rotation-state contract documented in the test KDoc (preserve or reset — either acceptable). ### Phase 5 — Billing sandbox (2–3 days, separable) — issue **#118**, see [§5](#5-billing-sandbox-device-path) for the device path - Pick the device path: **Option 2 selected** — external `billing-sandbox.yml` workflow running against a real Play-enabled device on a self-hosted runner, with the sandbox tier outside `regressionFullCheck`. - Pick an **entitlement repeatability strategy** for the non-consumable `premium_lifetime` product: (a) disposable tester-account pool, (b) manual refund/revoke + ad-hoc invocation, or (c) split contract (R-P03 once-per-release manual; R-P04 nightly). Documented in PR description and test KDoc — without this R-P03 silently degrades to a restore on the second run. - **Multi-precondition skip guard** as the first step of every scenario: `Assume.assumeTrue` checks `BILLING_SANDBOX_ACCOUNT`, signing material, selected device path, signed-in Play account, and a `BillingClient.queryProductDetailsAsync(premium_lifetime)` returning a non-empty list. Any failure skips with a specific assumption message. Replaces the previous narrow `BILLING_SANDBOX_ACCOUNT`-only guard. - Publish a Play Console / CI runbook (e.g. `docs/billing-sandbox.md`) covering package registration, license-tester setup, internal-track signed APK at a `versionCode` ≥ the test artifact, CI secrets checklist, and manual refund/revoke steps. Reviewers reject the PR if missing. - R-P03 / R-P04 ship per §3.7's full contracts (ADR-012 auto-navigate, explicit fresh-install definition for restore). ### Phase 6 — Macrobenchmark module (2 days) — issue **#119** - New `:macrobenchmark` module (`com.android.test` + `androidx.benchmark:benchmark-macro-junit4`), M-01..04. Baseline thresholds calibrated over 5 runs; alert on >20% regression. - Register a dedicated GMD entry inside `:macrobenchmark` so AGP generates `:macrobenchmark:pixel2api33BenchmarkAndroidTest`; wire that exact task into `regressionFullCheck`'s `dependsOn` **as a sibling outside any Phase 1 `GradleBuild` child invocation** so app-module instrumentation args don't leak into the benchmark run. Do **not** depend on `:macrobenchmark:connectedBenchmarkAndroidTest`. - **Prerequisites that must land before benchmarks are meaningful:** - **A. Production trace sections for M-04.** Add `androidx.tracing` spans (`nav.garage_to_detail`, `nav.detail_to_record`) covering destination composition + transition animation frames (not the synchronous nav event), OR switch M-04 to `FrameTimingMetric` over a scripted journey. `TraceSectionMetric` against a missing or event-only span is treated as the same failure. - **B. Stable test tags.** `TestTags.VEHICLE_LIST`, `TestTags.vehicleItem(id)`, `TestTags.RECORD_LIST` wired in production composables, plus `Modifier.semantics { testTagsAsResourceId = true }` applied to the Compose hierarchy hosting them so `By.res(...)` in `:macrobenchmark` resolves. - **C. Cross-module fixture ownership.** Either (a) extract seeding into a shared `:testfixtures` module, (b) expose a benchmark-variant-only test hook (`ContentProvider` / `ActivityAction`) **not gated on `BuildConfig.DEBUG`** (the benchmark variant is non-debuggable; gate on the `benchmark` build type instead), or (c) intentional duplication with sync requirement. - **D. Benchmark build variant.** `targetProjectPath = ":app"`, dedicated `benchmark` build type (`debuggable = false`, `signingConfig = signingConfigs.debug`, `matchingFallbacks = listOf("release")`, `` on `:app`'s manifest); `androidx.benchmark.suppressErrors` only for known-acceptable conditions (e.g. `EMULATOR`); macrobenchmark JSON output uploaded as a CI artifact on every run. - **Threshold policy** documented in the module README and enforced: **hard-fail** on relative regression > 20% against a committed 5-run baseline; absolute NFR targets (e.g. M-01 P50 ≤ 2000ms) are **advisory only** (warn + artifact, never red-X CI). Swiftshader absolute numbers are not authoritative; authoritative perf still requires a physical-device run outside this aggregate. ### Phase 7 — Hardening & nightly bake (ongoing) — issue **#120** - **Quarantine metadata.** Promote `@Quarantined` to carry `since: String` (ISO-8601), `issue: Int`, `reason: String` — or maintain a sidecar registry (`quarantine-registry.json`) keyed by fully-qualified test name with a JVM-test guard against divergence. Existing usages migrated in the same PR. - **Orphan-quarantine guard.** Every `@Quarantined` test must also carry one of `@Regression` / `@RegressionFull` / `@BillingSandbox`; CI guard fails the build on orphans. - **Stable scenario IDs.** Adopt spec IDs (`R-A01`, `M-03`, …) as the dashboard's stable key via `@ScenarioId("R-A01")` on the test method or a naming convention (`fun R_A01_…()`). Parameterized variants suffix and roll up. - **Flake-rate semantics** documented explicitly (passed-on-retry counts as flake; `Assume.assumeTrue` skips excluded from denominator; `@Ignore` excluded entirely; infrastructure failures recorded separately; benchmark failures categorised separately; cancellations excluded). Applied in the aggregator math. - **Durable history.** Pick one of (a) committed dashboard file (`docs/regression-dashboard.md` + `docs/regression-history.json` via auto-PR / orphan branch), (b) retained JSON artifacts + GitHub API, or (c) external store. Default recommendation: (a) for the published dashboard, (b) as the raw input the aggregator reads. - **Workflow permissions** declared explicitly per chosen source — `contents: write` + `pull-requests: write` on the dashboard-writing job, `actions: read` for prior-run lookup, `issues: write` for flake-comment posting; the main test-running job in `regression.yml` stays `contents: read`. - **Promotion playbook** (`docs/regression-promotion-playbook.md`): promote on 100 consecutive nightly passes on the stable scenario ID at runtime < 30s with no infra noise; demote on any nightly failure not reproduced as a real regression within 24h or runtime drift > 50%; de-quarantine threshold is owned by **Phase 7 #120** (30 consecutive green nightly runs); Phase 3 #116's quarantine policy tracks that value, and if changed it changes in #120 first. - **Tier audit pass.** Either a real promotion/demotion is executed, **or** the PR explicitly documents "no eligible candidate yet" with the dashboard linked as proof — the acceptance is *not* "at least one test promoted/demoted" (no artificial churn). ## 5. Billing sandbox device path The existing GMD pool in `app/build.gradle.kts` is AOSP-only (`systemImageSource = "aosp"` on every entry). Real Google Play Billing flows require a device with Google Play Services and the Play Store app present, which AOSP images do not ship. The plan therefore needs an explicit, separate device path for `@BillingSandbox` tests — *not* shoehorned into the AOSP `pixel2api33` device. Two acceptable options: 1. **Play-enabled GMD entry with a signed non-debug test build type.** Add a new managed device, e.g. `pixel2api33Play`, with `systemImageSource = "google_apis_playstore"` — this is AGP's documented source for a Play Store image, which Billing requires. (`"google"` / `"google_apis"` ship Google APIs but no Play Store, so they cannot drive real Billing purchase flows.) Add a new build type alongside the existing `debug`/`release` — e.g. `staging` — signed with the same upload key as the Play internal track, and arrange instrumentation to use it for this device only. The current build sets a global `testBuildType = testVariant` in `app/build.gradle.kts:68` (with `testVariant` resolved at line 43 to `debug` outside the screenshot-build branch), so this needs either a per-device override or a top-level switch when the sandbox device is selected; either way, the resulting task name carries the build type (e.g. `pixel2api33PlayStagingAndroidTest`), and `regressionFullSandboxCheck` targets that task — *not* `pixel2api33PlayDebugAndroidTest`, because a debug-signed APK is not what Play has on the internal track and may fail real Billing purchase/restore. Caveats: Play Store images on GMD are interactive (require Google account sign-in on first boot) and historically harder to drive headlessly; this needs a one-time human setup of the license-tester account on the snapshot, plus the signing key has to be available to the runner that builds the staging APK. 2. **Manual / external device path (recommended when key management on the GMD runner is impractical).** Keep real billing verification outside GMD entirely — run `@BillingSandbox` tests on a real Play-enabled device or a Firebase Test Lab "physical" device, triggered by a separate workflow (`billing-sandbox.yml`) that nightly-builds an internal-track signed APK from a dedicated staging/release build type (so Play recognises the artifact) and runs the sandbox tier against a Play-enabled target. `regressionFullCheck` in this option does *not* dispatch the sandbox invocation locally and instead reports "sandbox runs in the dedicated billing workflow." This option avoids putting the upload key on the GMD runner and skips the headless Play Store sign-in problem from Option 1. Phase 5 must pick one of these before any `@BillingSandbox` tests are written. Local developer runs skip the tier via the multi-precondition `Assume.assumeTrue` guard described in §4 Phase 5 — any of `BILLING_SANDBOX_ACCOUNT`, signing material, device-path availability, signed-in Play account, or `queryProductDetailsAsync(premium_lifetime)` missing causes a clean skip with a specific assumption message, regardless of the option chosen above.