GitHub Availability Report: July 2024

In July, we experienced four incidents that resulted in degraded performance across GitHub services.

July 5 00:53 16:31 UTC (lasting 97 minutes)

On July 5, between 16:31 and 18:08 UTC, the Webhooks service experienced performance degradation, resulting in delayed Webhooksâ€™ deliveries with an average delay of 24 minutes and a maximum of 71 minutes. This issue was triggered by a configuration change that removed authentication from Webhooksâ€™ background job requests, causing these requests to be rejected. Since Webhooks relies on this job infrastructure, external Webhooks delivery failed. Webhooks delivery resumed once the configuration was repaired.

Following the initial fix, a secondary issue from 18:21 to 21:14 UTC caused further delays in GitHub Actions runs on pull requests due to failing health probes in the background job processing service, which created a crash loop in the background job API layer, reducing capacity. This reduction of capacity added an average of 45 second delay with a maximum of 1 minute 54 second delays to job delivery. This was resolved with a service deployment.

To improve incident detection, we have updated our dashboards, improved our health checks, and introduced new alerts for similar issues. We are also focused on minimizing the impact of such incidents in the future through better workload isolation.

July 13 00:01 UTC (lasting 19 hours and 26 minutes)

On July 13, between 00:01 and 19:27 UTC, the GitHub Copilot service experienced degradation. During this period, the error rate for Copilot code completions reached 1.16%, and for GitHub Copilot Chat, it peaked at 63%. We rerouted Copilot Chat traffic between 01:00 and 02:00 UTC, reducing Copilot Chat error rates to below 6%. Copilot Chat completions error rate generally stayed below 1%. Customers may have experienced delays, errors, or timeouts during this period for Copilot completions and Copilot Chat. GitHub code scanning autofix dropped suggested fixes between 00:01 UTC and 12:38 UTC and delayed, but eventually completed, suggested fixes between 12:38 UTC and 21:38 UTC.

We determined that the issue originated from a resource cleanup job executed by a partner service on July 13, which mistakenly targeted a resource group containing essential resources, leading to their removal. The job was stopped in time to preserve some resources, allowing GitHub to mitigate the impact while resources were being restored.

We are collaborating with partner services to implement safeguards against future incidents and enhancing our traffic rerouting processes to expedite future mitigation efforts.

July 16 00:53 UTC (lasting 149 minutes)

On July 16, between 00:30 UTC and 03:07 UTC, Copilot Chat was degraded and rejected all requests. The error rate was close to 100% during this time period and customers would have received errors when attempting to use Copilot Chat.

This was triggered during routine maintenance from a service provider, when GitHub services were disconnected and overwhelmed the dependent service during reconnections.

To mitigate the issue in the future, we are working to improve our reconnection and circuit-breaking logic to dependent services to recover from this kind of event seamlessly, without overwhelming the other service.

July 18 22:47 UTC (lasting 231 minutes)

Beginning on July 18, 2024 at 22:38 UTC, network issues within an upstream provider led to degraded experiences across Actions, Copilot, and GitHub Pages services. During this time, up to 50% of Actions workflow jobs were stuck in the queuing state, including Pages deployments. Users also could not enable Actions or register self-hosted runners. This was caused by an unreachable backend resource in the central US region. This resource is configured for geo-replication, but the replication configuration prevented resiliency when one region was unavailable. Updating the replication configuration mitigated the impact by allowing successful requests while one region was unavailable. By July 19 00:12 UTC, users saw some improvement in Actions jobs and full recovery of Pages. Standard hosted runners and self-hosted Actions workflows were healthy by 2:10 UTC and large-hosted runners fully recovered at 2:38 UTC.

Copilot requests were also impacted with up to 2% of Copilot Chat requests and 0.5% of Copilot Completions requests resulting in errors. Copilot Chat requests were routed to other regions after 20 minutes while Copilot Completions requests took 45 minutes to reroute.

To mitigate these issues moving forward, we are enhancing our replication and failover workflows to better handle such situations and reduce the time needed to recover, minimizing the impact on customers.

Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what weâ€™re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: July 2024 appeared first on The GitHub Blog.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

GitHub Availability Report: July 2024

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

New tool evaluates progress in reinforcement learning

A Comprehensive Tutorial on the Five Levels of Agentic AI Architectures: From Basic Prompt Responses to Fully Autonomous Code Generation and Execution

CVE-2025-48127 – “App Cheap Push Notification Authorization Bypass”

ChatBI: A Comprehensive and Efficient Technology for Solving the Natural Language to Business Intelligence NL2BI Task

How to inspect a element in a drop down menu in chrome and firefox?

How to Upgrade Liferay 7.0 to 7.4 Migration

(Hyper) Links About (Hyper) Links

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

GitHub Availability Report: July 2024

Related Posts