AI SRE — Design Partner One-Pager¶
What we're building¶
An AI Incident Assistant that reduces incident resolution time (MTTR) through:
- Alert ingestion — Receive alerts from PagerDuty, webhooks (e.g. Prometheus), Opsgenie; normalize to one incident model.
- Log summarization — Fetch relevant logs (e.g. Loki) and produce a short incident summary.
- Slack assistant — In Slack: "Summarize this incident", "What's the likely cause?"; optional approval for safe actions.
- Limited safe actions — Restart pod, scale deployment (with guardrails and dry-run).
- Config fix suggestions — Suggest resource limits or replica changes; open a PR (no auto-merge).
What we need from you (4–8 weeks)¶
- Staging or low-risk cluster — Kubernetes cluster we can run the agent against (or alert-forwarding only at first).
- Weekly 30-min sync — Feedback on summaries, suggested actions, and UX.
- Permission to quote results — e.g. "X% MTTR reduction" or "Time to first response under 2 minutes" (anonymized if you prefer).
What you get¶
- Product at cost or free during the pilot.
- Early input on features and safe-action set.
- Proof that the agent reduces MTTR before we scale.
Tech requirements¶
- Kubernetes (we support one cluster to start).
- One alert source: PagerDuty, Opsgenie, or webhook (e.g. Prometheus Alertmanager).
- Slack workspace for the bot.
- Optional: Grafana Loki (or one log provider) for log summarization.
Contact¶
[Your contact / Calendly]