Skip to content

AI SRE — Design Partner One-Pager

What we're building

An AI Incident Assistant that reduces incident resolution time (MTTR) through:

  • Alert ingestion — Receive alerts from PagerDuty, webhooks (e.g. Prometheus), Opsgenie; normalize to one incident model.
  • Log summarization — Fetch relevant logs (e.g. Loki) and produce a short incident summary.
  • Slack assistant — In Slack: "Summarize this incident", "What's the likely cause?"; optional approval for safe actions.
  • Limited safe actions — Restart pod, scale deployment (with guardrails and dry-run).
  • Config fix suggestions — Suggest resource limits or replica changes; open a PR (no auto-merge).

What we need from you (4–8 weeks)

  • Staging or low-risk cluster — Kubernetes cluster we can run the agent against (or alert-forwarding only at first).
  • Weekly 30-min sync — Feedback on summaries, suggested actions, and UX.
  • Permission to quote results — e.g. "X% MTTR reduction" or "Time to first response under 2 minutes" (anonymized if you prefer).

What you get

  • Product at cost or free during the pilot.
  • Early input on features and safe-action set.
  • Proof that the agent reduces MTTR before we scale.

Tech requirements

  • Kubernetes (we support one cluster to start).
  • One alert source: PagerDuty, Opsgenie, or webhook (e.g. Prometheus Alertmanager).
  • Slack workspace for the bot.
  • Optional: Grafana Loki (or one log provider) for log summarization.

Contact

[Your contact / Calendly]