In July, we experienced four incidents that resulted in degraded performance across GitHub services.
July 5 00:53 16:31 UTC (lasting 97 minutes)
On July 5, between 16:31 and 18:08 UTC, the Webhooks service experienced performance degradation, resulting in delayed Webhooks’ deliveries with an average delay of 24 minutes and a maximum of 71 minutes. This issue was triggered by a configuration change that removed authentication from Webhooks’ background job requests, causing these requests to be rejected. Since Webhooks relies on this job infrastructure, external Webhooks delivery failed. Webhooks delivery resumed once the configuration was repaired.
Following the initial fix, a secondary issue from 18:21 to 21:14 UTC caused further delays in GitHub Actions runs on pull requests due to failing health probes in the background job processing service, which created a crash loop in the background job API layer, reducing capacity. This reduction of capacity added an average of 45 second delay with a maximum of 1 minute 54 second delays to job delivery. This was resolved with a service deployment.
To improve incident detection, we have updated our dashboards, improved our health checks, and introduced new alerts for similar issues. We are also focused on minimizing the impact of such incidents in the future through better workload isolation.
July 13 00:01 UTC (lasting 19 hours and 26 minutes)
On July 13, between 00:01 and 19:27 UTC, the GitHub Copilot service experienced degradation. During this period, the error rate for Copilot code completions reached 1.16%, and for GitHub Copilot Chat, it peaked at 63%. We rerouted Copilot Chat traffic between 01:00 and 02:00 UTC, reducing Copilot Chat error rates to below 6%. Copilot Chat completions error rate generally stayed below 1%. Customers may have experienced delays, errors, or timeouts during this period for Copilot completions and Copilot Chat. GitHub code scanning autofix dropped suggested fixes between 00:01 UTC and 12:38 UTC and delayed, but eventually completed, suggested fixes between 12:38 UTC and 21:38 UTC.
We determined that the issue originated from a resource cleanup job executed by a partner service on July 13, which mistakenly targeted a resource group containing essential resources, leading to their removal. The job was stopped in time to preserve some resources, allowing GitHub to mitigate the impact while resources were being restored.
We are collaborating with partner services to implement safeguards against future incidents and enhancing our traffic rerouting processes to expedite future mitigation efforts.
July 16 00:53 UTC (lasting 149 minutes)
On July 16, between 00:30 UTC and 03:07 UTC, Copilot Chat was degraded and rejected all requests. The error rate was close to 100% during this time period and customers would have received errors when attempting to use Copilot Chat.
This was triggered during routine maintenance from a service provider, when GitHub services were disconnected and overwhelmed the dependent service during reconnections.
To mitigate the issue in the future, we are working to improve our reconnection and circuit-breaking logic to dependent services to recover from this kind of event seamlessly, without overwhelming the other service.
July 18 22:47 UTC (lasting 231 minutes)
Beginning on July 18, 2024 at 22:38 UTC, network issues within an upstream provider led to degraded experiences across Actions, Copilot, and GitHub Pages services. During this time, up to 50% of Actions workflow jobs were stuck in the queuing state, including Pages deployments. Users also could not enable Actions or register self-hosted runners. This was caused by an unreachable backend resource in the central US region. This resource is configured for geo-replication, but the replication configuration prevented resiliency when one region was unavailable. Updating the replication configuration mitigated the impact by allowing successful requests while one region was unavailable. By July 19 00:12 UTC, users saw some improvement in Actions jobs and full recovery of Pages. Standard hosted runners and self-hosted Actions workflows were healthy by 2:10 UTC and large-hosted runners fully recovered at 2:38 UTC.
Copilot requests were also impacted with up to 2% of Copilot Chat requests and 0.5% of Copilot Completions requests resulting in errors. Copilot Chat requests were routed to other regions after 20 minutes while Copilot Completions requests took 45 minutes to reroute.
To mitigate these issues moving forward, we are enhancing our replication and failover workflows to better handle such situations and reduce the time needed to recover, minimizing the impact on customers.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
The post GitHub Availability Report: July 2024 appeared first on The GitHub Blog.
Source: Read MoreÂ