Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 13, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025

      From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

      June 13, 2025

      New code strings attached to Xbox Game Pass suggests a price increase may be imminent

      June 13, 2025

      This could be the versatile laptop accessory I’ve been waiting for — Here’s why it stands out from other portable monitors

      June 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Worker Threads in Node.js: A Complete Guide for Multithreading in JavaScript

      June 13, 2025
      Recent

      Worker Threads in Node.js: A Complete Guide for Multithreading in JavaScript

      June 13, 2025

      Everybody’s gone lintin’

      June 13, 2025

      QAQ-QQ-AI-QUEST

      June 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025
      Recent

      OneDrive for Mac will soon give you more flexible storage options

      June 13, 2025

      From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

      June 13, 2025

      New code strings attached to Xbox Game Pass suggests a price increase may be imminent

      June 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»News & Updates»How GitHub engineers tackle platform problems

    How GitHub engineers tackle platform problems

    June 10, 2025

    In my spare time I enjoy building Gundam models, which are model kits to build iconic mechas from the Gundam universe. You might be wondering what this has to do with software engineering. Product engineers can be seen as the engineers who take these kits and build the Gundam itself. They are able to utilize all pieces and build a working product that is fun to collect or even play with!

    Platform engineers, on the other hand, supply the tools needed to build these kits (like clippers and files) and maybe even build a cool display so everyone can see the final product. They ensure that whoever is constructing it has all the necessary tools, even if they don’t physically build the Gundam themselves.

    A photograph of several Gundam models on a shelf.

    About a year ago, my team at GitHub moved to the infrastructure organization, inheriting new roles and Areas of Responsibility (AoRs). Previously, the team had tackled external customer problems, such as building the new deployment views across environments. This involved interacting with users who depend on GitHub to address challenges within their respective industries. Our new customers as a platform engineering team are internal, which makes our responsibilities different from the product-focused engineering work we were doing before.

    Going back to my Gundam example, rather than constructing kits, we’re now responsible for building the components of the kits. Adapting to this change meant I had to rethink my approach to code testing and problem solving.

    Whether you’re working on product engineering or on the platform side, here are a few best practices to tackle platform problems.

    Understanding your domain

    One of the most critical steps before tackling problems is understanding the domain. A “domain” is the business and technical subject area in which a team and platform organization operate. This requires gaining an understanding of technical terms and how these systems interact to provide fast and reliable solutions. Here’s how to get up to speed: 

    • Talk to your neighbors: Arrange a handover meeting with a team that has more knowledge and experience with the subject matter. This meeting provides an opportunity to ask questions about terminology and gain a deeper understanding of the problems the team will be addressing. 
    • Investigate old issues: If there is a backlog of issues that are either stale or still persistent, they may give you a better understanding of the system’s current limitations and potential areas for improvement.
    • Read the docs: Documentation is a goldmine of knowledge that can help you understand how the system works. 

    Bridging concepts to platform-specific skills

    While the preceding advice offers general guidance applicable to both product and platform teams, platform teams — serving as the foundational layer — necessitate a more in-depth understanding.

    • Networks: Understanding network fundamentals is crucial for all engineers, even those not directly involved in network operations. This includes concepts like TCP, UDP, and L4 load balancing, as well as debugging tools such as dig. A solid grasp of these areas is essential to comprehend how network traffic impacts your platform.
    • Operating systems and hardware: Selecting appropriate virtual machines (VMs) or physical hardware is vital for both scalability and cost management. Making well-informed choices for particular applications requires a strong grasp of both. This is closely linked to choosing the right operating system for your machines, which is important to avoid systems with vulnerabilities or those nearing end of life.
    • Infrastructure as Code (IaC): Automation tools like Terraform, Ansible, and Consul are becoming increasingly essential. Proficiency in these tools is becoming a necessity as they significantly decrease human error during infrastructure provisioning and modifications. 
    • Distributed systems: Dealing with platform issues, particularly in distributed systems, necessitates a deep understanding that failures are inevitable. Consequently, employing proactive solutions like failover and recovery mechanisms is crucial for preserving system reliability and preventing adverse user experiences. The optimal approach for this depends entirely on the specific problem and the desired system behavior.

    Knowledge sharing

    By sharing lessons and ideas, engineers can introduce new perspectives that lead to breakthroughs and innovations. Taking the time to understand why a project or solution did or didn’t work and sharing those findings provides new perspectives that we can use going forward.

    Here are three reasons why knowledge sharing is so important: 

    • Teamwork makes the dream work: Collaboration often results in quicker problem resolution and fosters new solution innovation, as engineers have the opportunity to learn from each other and expand upon existing ideas.
    • Prevent lost knowledge: If we don’t share our lessons learned, we prevent the information from being disseminated across the team or organization. This becomes a problem if an engineer leaves the company or is simply unavailable.
    • Improve our customer success: As engineers, our solutions should effectively serve our customers. By sharing our knowledge and lessons learned, we can help the team build reliable, scalable, and secure platforms, which will enable us to create better products that meet customer needs and expectations!

    But big differences start to appear between product engineering and infrastructure engineering when it comes to the impact radius and the testing process.

    Impact radius

    With platforms being the fundamental building blocks of a system, any change (small or large) can affect a wide range of products. Our team is responsible for DNS, a foundational service that impacts numerous products. Even a minor alteration to this service can have extensive repercussions, potentially disrupting access to content across our site and affecting products ranging from GitHub Pages to GitHub Copilot. 

    • Understand the radius: Or understand the downstream dependencies. Direct communication with teams that depend on our service provides valuable insights into how proposed changes may affect other services.
    • Postmortems: By looking at past incidents related to our platform and asking “What is the impact of this incident?”, we can form more context around what change or failure was introduced, how our platform played a role in it, and how it was fixed.
    • Monitoring and telemetry: Condense important monitoring and logging into a small and quickly digestible medium to give you the general health of the system. This could be a Single Availability Metric (SAM), for example. The ability to quickly glance at a single dashboard allows engineers to rapidly pinpoint the source of an issue and streamlines the debugging and incident mitigation process, as compared to searching through and interpreting detailed monitors or log messages.

    Testing changes

    Testing changes in a distributed environment can be challenging, especially for services like DNS. A crucial step in solving this issue is utilizing a test site as a “real” machine where you can implement and assess all your changes. 

    • Infrastructure as Code (IaC): When using tools like Terraform or Ansible, it’s crucial to test fundamental operations like provisioning and deprovisioning machines. There are circumstances where a machine will need to be re-provisioned. In these cases, we want to ensure the machine is not accidentally deleted and that we retain the ability to create a new one if needed.
    • End-to-End (E2E): Begin directing some network traffic to these servers. Then the team can observe host behavior by directly interacting with it, or we can evaluate functionality by diverting a small portion of traffic.
    • Self-healing: We want to test the platform’s ability to recover from unexpected loads and identify bottlenecks before they impact our users. Early identification of bottlenecks or bugs is crucial for maintaining the health of our platform.

    Ideally changes will be implemented on a host-by-host basis once testing is complete. This approach allows for individual machine rollback and prevents changes from being applied to unaffected hosts.

    What to remember

    Platform engineering can be difficult. The systems GitHub operates with are complex and there are a lot of services and moving parts. However, there’s nothing like seeing everything come together. All the hard work our engineering teams do behind the scenes really pays off when the platform is running smoothly and teams are able to ship faster and more reliably — which allows GitHub to be the home to all developers.

    Want to dive deeper? Check out our infrastructure related blog posts.

    The post How GitHub engineers tackle platform problems appeared first on The GitHub Blog.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUbuntu 24.10 Support Ends July 10th – Upgrade Soon
    Next Article CVE-2025-4678 – Pandora ITSM OS Command Injection

    Related Posts

    News & Updates

    OneDrive for Mac will soon give you more flexible storage options

    June 13, 2025
    News & Updates

    From The Editor’s Desk — new Windows Central community features, we’d like to hear from you!

    June 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-48054 – Radashi Prototype Pollution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5411 – Mist Community Edition Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    How to Create Accessible and User-Friendly Forms in React

    Development

    Microsoft will gradually retire SharePoint Alerts over the next two years

    Operating Systems

    Highlights

    The Mainframe Muggle Chronicles – Part 2: A Heretic Among Zealots

    May 22, 2025

    Note from the Editor: A Heretic Among Zealots At Planet Mainframe, we appreciate a fresh…

    Google DeepMind at ICLR 2024

    May 29, 2025

    Ghibli AI Image Generator

    April 16, 2025

    No Power Outage, Just a Data One: Nova Scotia Hit by Ransomware Surge

    May 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.