AI SRE — Design Partner One-Pager¶

What we're building¶

An AI Incident Assistant that reduces incident resolution time (MTTR) through:

Alert ingestion — Receive alerts from PagerDuty, webhooks (e.g. Prometheus), Opsgenie; normalize to one incident model.
Log summarization — Fetch relevant logs (e.g. Loki) and produce a short incident summary.
Slack assistant — In Slack: "Summarize this incident", "What's the likely cause?"; optional approval for safe actions.
Limited safe actions — Restart pod, scale deployment (with guardrails and dry-run).
Config fix suggestions — Suggest resource limits or replica changes; open a PR (no auto-merge).

Staging or low-risk cluster — Kubernetes cluster we can run the agent against (or alert-forwarding only at first).
Weekly 30-min sync — Feedback on summaries, suggested actions, and UX.
Permission to quote results — e.g. "X% MTTR reduction" or "Time to first response under 2 minutes" (anonymized if you prefer).

Kubernetes (we support one cluster to start).
One alert source: PagerDuty, Opsgenie, or webhook (e.g. Prometheus Alertmanager).
Slack workspace for the bot.
Optional: Grafana Loki (or one log provider) for log summarization.

[Your contact / Calendly]