Slack outage post mortem

11/24/2023

this will create a new chatroom in Slack where all other communication should occur.set the description and severity from the modal in Slack.Start a new incident with the incident.If the incident was reported by someone outside of Sourcegraph, acknowledge that the incident is being handled.The first Sourcegraph teammate (regardless of their role) that becomes aware of an incident, or suspects there may be an incident, is responsible for taking the following actions: io.Įven incidents that might turn out to be false positives should be reported, to ensure that they are responded to and investigated with the same rigor as any incident, and that any lessons can be learnt. customers, Sourcegraph teammates) via incident. Incidents can be identified by anyone (e.g. io, and past incidents are available in the incidents.io dashboard and our past incident postmortems. The person leading the announcement will work with #customer-support on the ad-hoc plan for incidents (which may involve on-call rotation).Īll incidents are announced in the #announce-incidents channel automatically through incident. On these days, the person in marketing leading the announcement is responsible for looping #customer-support and engineering/product in ahead of time to ensure they are aware of planned activities. We need to do critical proactive 1-to-many communication to all self-hosted customers (for example, making them aware of something they need to do in a certain upgrade like the prep needed before upgrading to 3.31) – over time, as we do more of this, we will likely create a separate process for thisĪdditionally, on big announcement days (funding, product launch, campaign launch, etc.), all incidents warrant more immediate attention from marketing so we can hold off on planned activities/be prepared to respond to issues.A Sourcegraph team member feels like an incident might be present, but isn’t certain or isn’t able to confirm on their own.There is a security issue with Sourcegraph (and if so, please also follow our security disclosure process).We have an issue (per our standard SLA definition) that impacts all/many self-hosted instances, all/many managed instances, or all/many Cloud/SaaS users.If you’re unsure if the incident’s impact qualifies, ask in Slack for advice. If is down for more than 5 minutes, a critical feature is down for more than 5 minutes, or we’re aware of a service degradation issue that >5 users have reported.is down or a critical feature is broken (e.g.Identifying and resolving incidents helps us improve: we’ll make Sourcegraph a higher quality product, we’ll improve the processes that lead to or around the incident, and we’ll reduce friction around identifying incidents in the future. An incident is any unplanned event that causes a service disruption.

0 Comments

Slack outage post mortem

Leave a Reply.

Author

Archives

Categories