Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      From Data To Decisions: UX Strategies For Real-Time Dashboards

      September 13, 2025

      Honeycomb launches AI observability suite for developers

      September 13, 2025

      Low-Code vs No-Code Platforms for Node.js: What CTOs Must Know Before Investing

      September 12, 2025

      ServiceNow unveils Zurich AI platform

      September 12, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025

      Distribution Release: Q4OS 6.1

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Optimizely Mission Control – Part III

      September 14, 2025
      Recent

      Optimizely Mission Control – Part III

      September 14, 2025

      Learning from PHP Log to File Example

      September 13, 2025

      Online EMI Calculator using PHP – Calculate Loan EMI, Interest, and Amortization Schedule

      September 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

      September 14, 2025
      Recent

      sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

      September 14, 2025

      Dmitry — The Deep Magic

      September 13, 2025

      Right way to record and share our Terminal sessions

      September 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»News & Updates»How GitHub engineers tackle platform problems

    How GitHub engineers tackle platform problems

    June 10, 2025

    In my spare time I enjoy building Gundam models, which are model kits to build iconic mechas from the Gundam universe. You might be wondering what this has to do with software engineering. Product engineers can be seen as the engineers who take these kits and build the Gundam itself. They are able to utilize all pieces and build a working product that is fun to collect or even play with!

    Platform engineers, on the other hand, supply the tools needed to build these kits (like clippers and files) and maybe even build a cool display so everyone can see the final product. They ensure that whoever is constructing it has all the necessary tools, even if they don’t physically build the Gundam themselves.

    A photograph of several Gundam models on a shelf.

    About a year ago, my team at GitHub moved to the infrastructure organization, inheriting new roles and Areas of Responsibility (AoRs). Previously, the team had tackled external customer problems, such as building the new deployment views across environments. This involved interacting with users who depend on GitHub to address challenges within their respective industries. Our new customers as a platform engineering team are internal, which makes our responsibilities different from the product-focused engineering work we were doing before.

    Going back to my Gundam example, rather than constructing kits, we’re now responsible for building the components of the kits. Adapting to this change meant I had to rethink my approach to code testing and problem solving.

    Whether you’re working on product engineering or on the platform side, here are a few best practices to tackle platform problems.

    Understanding your domain

    One of the most critical steps before tackling problems is understanding the domain. A “domain” is the business and technical subject area in which a team and platform organization operate. This requires gaining an understanding of technical terms and how these systems interact to provide fast and reliable solutions. Here’s how to get up to speed: 

    • Talk to your neighbors: Arrange a handover meeting with a team that has more knowledge and experience with the subject matter. This meeting provides an opportunity to ask questions about terminology and gain a deeper understanding of the problems the team will be addressing. 
    • Investigate old issues: If there is a backlog of issues that are either stale or still persistent, they may give you a better understanding of the system’s current limitations and potential areas for improvement.
    • Read the docs: Documentation is a goldmine of knowledge that can help you understand how the system works. 

    Bridging concepts to platform-specific skills

    While the preceding advice offers general guidance applicable to both product and platform teams, platform teams — serving as the foundational layer — necessitate a more in-depth understanding.

    • Networks: Understanding network fundamentals is crucial for all engineers, even those not directly involved in network operations. This includes concepts like TCP, UDP, and L4 load balancing, as well as debugging tools such as dig. A solid grasp of these areas is essential to comprehend how network traffic impacts your platform.
    • Operating systems and hardware: Selecting appropriate virtual machines (VMs) or physical hardware is vital for both scalability and cost management. Making well-informed choices for particular applications requires a strong grasp of both. This is closely linked to choosing the right operating system for your machines, which is important to avoid systems with vulnerabilities or those nearing end of life.
    • Infrastructure as Code (IaC): Automation tools like Terraform, Ansible, and Consul are becoming increasingly essential. Proficiency in these tools is becoming a necessity as they significantly decrease human error during infrastructure provisioning and modifications. 
    • Distributed systems: Dealing with platform issues, particularly in distributed systems, necessitates a deep understanding that failures are inevitable. Consequently, employing proactive solutions like failover and recovery mechanisms is crucial for preserving system reliability and preventing adverse user experiences. The optimal approach for this depends entirely on the specific problem and the desired system behavior.

    Knowledge sharing

    By sharing lessons and ideas, engineers can introduce new perspectives that lead to breakthroughs and innovations. Taking the time to understand why a project or solution did or didn’t work and sharing those findings provides new perspectives that we can use going forward.

    Here are three reasons why knowledge sharing is so important: 

    • Teamwork makes the dream work: Collaboration often results in quicker problem resolution and fosters new solution innovation, as engineers have the opportunity to learn from each other and expand upon existing ideas.
    • Prevent lost knowledge: If we don’t share our lessons learned, we prevent the information from being disseminated across the team or organization. This becomes a problem if an engineer leaves the company or is simply unavailable.
    • Improve our customer success: As engineers, our solutions should effectively serve our customers. By sharing our knowledge and lessons learned, we can help the team build reliable, scalable, and secure platforms, which will enable us to create better products that meet customer needs and expectations!

    But big differences start to appear between product engineering and infrastructure engineering when it comes to the impact radius and the testing process.

    Impact radius

    With platforms being the fundamental building blocks of a system, any change (small or large) can affect a wide range of products. Our team is responsible for DNS, a foundational service that impacts numerous products. Even a minor alteration to this service can have extensive repercussions, potentially disrupting access to content across our site and affecting products ranging from GitHub Pages to GitHub Copilot. 

    • Understand the radius: Or understand the downstream dependencies. Direct communication with teams that depend on our service provides valuable insights into how proposed changes may affect other services.
    • Postmortems: By looking at past incidents related to our platform and asking “What is the impact of this incident?”, we can form more context around what change or failure was introduced, how our platform played a role in it, and how it was fixed.
    • Monitoring and telemetry: Condense important monitoring and logging into a small and quickly digestible medium to give you the general health of the system. This could be a Single Availability Metric (SAM), for example. The ability to quickly glance at a single dashboard allows engineers to rapidly pinpoint the source of an issue and streamlines the debugging and incident mitigation process, as compared to searching through and interpreting detailed monitors or log messages.

    Testing changes

    Testing changes in a distributed environment can be challenging, especially for services like DNS. A crucial step in solving this issue is utilizing a test site as a “real” machine where you can implement and assess all your changes. 

    • Infrastructure as Code (IaC): When using tools like Terraform or Ansible, it’s crucial to test fundamental operations like provisioning and deprovisioning machines. There are circumstances where a machine will need to be re-provisioned. In these cases, we want to ensure the machine is not accidentally deleted and that we retain the ability to create a new one if needed.
    • End-to-End (E2E): Begin directing some network traffic to these servers. Then the team can observe host behavior by directly interacting with it, or we can evaluate functionality by diverting a small portion of traffic.
    • Self-healing: We want to test the platform’s ability to recover from unexpected loads and identify bottlenecks before they impact our users. Early identification of bottlenecks or bugs is crucial for maintaining the health of our platform.

    Ideally changes will be implemented on a host-by-host basis once testing is complete. This approach allows for individual machine rollback and prevents changes from being applied to unaffected hosts.

    What to remember

    Platform engineering can be difficult. The systems GitHub operates with are complex and there are a lot of services and moving parts. However, there’s nothing like seeing everything come together. All the hard work our engineering teams do behind the scenes really pays off when the platform is running smoothly and teams are able to ship faster and more reliably — which allows GitHub to be the home to all developers.

    Want to dive deeper? Check out our infrastructure related blog posts.

    The post How GitHub engineers tackle platform problems appeared first on The GitHub Blog.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUbuntu 24.10 Support Ends July 10th – Upgrade Soon
    Next Article CVE-2025-4678 – Pandora ITSM OS Command Injection

    Related Posts

    News & Updates

    Building personal apps with open source and AI

    September 12, 2025
    News & Updates

    What Can We Actually Do With corner-shape?

    September 12, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Apple unveils Liquid Glass at WWDC, but all I see is a sorry imitation of Windows Vista’s Aero Glass

    News & Updates

    Microsoft Gaming CEO Phil Spencer is trying hard to avoid leaking all his Xbox plans

    News & Updates

    CVE-2024-47081 – Requests .netrc Credential Leakage Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    AI-Driven Antitrust and Competition Law: Algorithmic Collusion, Self-Learning Pricing Tools, and Legal Challenges in the US and EU

    Machine Learning

    Highlights

    News & Updates

    Xbox and Obsidian just shared the Grounded 2 Early Access roadmap — here’s when the Game Pass sequel is getting new content and language updates

    July 29, 2025

    Following the launch of Grounded 2 on Xbox and PC, developer Obsidian has shared the…

    I’m a tech expert, and these Fourth of July phone deals are worth upgrading to

    June 26, 2025

    Firefox 141: l’AI è un problema, i consumi di CPU e batteria aumentano

    August 13, 2025

    Commodore OS is a fan-made Commodore inspired Linux distribution

    April 13, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.