Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      I replaced my Samsung OLED TV with this Sony Mini LED model for a week – and didn’t regret it

      July 29, 2025

      I tested the most popular robot mower on the market – and it was a $5,000 crash out

      July 29, 2025

      5 gadgets and accessories that leveled up my gaming setup (including a surprise console)

      July 29, 2025

      Why I’m patiently waiting for the Samsung Z Fold 8 next year (even though the foldable is already great)

      July 29, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025
      Recent

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025

      Memoization and Function Caching with this PHP Package

      July 29, 2025

      Laracon US 2025 Livestream

      July 29, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025
      Recent

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025

      With Windows 10’s fast-approaching demise, this Linux migration tool could let you ditch Microsoft’s ecosystem with your data and apps intact — but it’s limited to one distro

      July 29, 2025

      Windows 10 is 10 years old today — let’s look back at 10 controversial and defining moments in its history

      July 29, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025

    Amazon researchers developed a new AI architecture that cuts inference time by 30% by selecting only task-relevant neurons, similar to how the brain uses specialized regions for specific tasks. This breakthrough approach addresses one of the biggest challenges facing large AI models: the computational expense and latency associated with activating every neuron for every request, regardless of their relevance.

    The traditional deployment of large language models (LLMs) and foundational AI systems has relied on activating the full network for every input. While this guarantees versatility, it results in significant inefficiency—much of the network’s activity is superfluous for any given prompt. Inspired by the human brain’s efficiency—the brain flexibly recruits only the circuits it needs for a given cognitive task—Amazon’s architecture mimics this behavior by activating neurons most relevant to the current input context.

    Dynamic, Context-Aware Pruning

    At the heart of this innovation is dynamic, context-aware pruning. Rather than trimming the model statically during training and locking in those changes, Amazon’s solution prunes the network “on the fly,” during inference itself. This enables the model to remain large and versatile, yet efficient and fast-active for any specific task.

    • Before processing an input, the model evaluates which neurons or modules will be most useful, based on signals such as the type of task (e.g., legal writing, translation, or coding assistance), language, and other context features.
    • It leverages a gate predictor, a lightweight neural component trained to generate a “mask” that determines which neurons are switched on for that particular sequence.
    • The gating decisions are binary, so neurons are either fully active or completely skipped, ensuring real compute savings.

    How the System Works

    The architecture introduces a context-aware gating mechanism. This mechanism analyzes input features (and, for speech models, auxiliary information such as language and task tokens) to decide which modules—such as self-attention blocks, feed-forward networks, or specialized convolutions—are essential for the current step. For example, in a speech recognition task, it may activate local context modules for detailed sound analysis while skipping unnecessary components that are only beneficial for other tasks.

    This pruning strategy is structured and modular: instead of removing individual weights (which can lead to hardware inefficiency), it skips entire modules or layers. This preserves the model’s structural integrity and ensures compatibility with GPU and modern hardware accelerators.

    The gate predictor model is trained with a sparsity loss to achieve a target sparsity: the proportion of modules skipped. Training uses techniques like the Gumbel-Softmax estimator, ensuring that gating behavior remains differentiable during optimization, but ultimately yields crisp, binary neuron selection at inference.

    Demonstrated Results: Speed Without Sacrificing Quality

    Experiments show that dynamically skipping irrelevant modules can:

    • Reduce inference time by up to 34% for multilingual speech-to-text or automatic speech recognition (ASR) tasks—where typical baseline models suffered 9.28s latency, pruned models ran in as little as 5.22s, depending on the task and desired sparsity level.
    • Decrease FLOPs (floating-point operations) by over 60% at high sparsity levels, greatly lowering cloud and hardware costs.
    • Maintain output quality: Pruning the decoder in particular preserves BLEU scores (for translation tasks) and Word Error Rate (WER) for ASR up to moderate sparsity, meaning users see no drop in model performance until very aggressive pruning is applied.
    • Provide interpretability: Analyzing pruned module patterns reveals which parts of the model are essential for each context—local context modules dominate in ASR, while feed-forward networks are prioritized for speech translation.

    Task and Language Adaptation

    A core insight is that optimal pruning strategies—meaning which modules to retain or skip—can change dramatically depending on the task and language. For instance:

    • In ASR, the importance of local context modules (cgMLP) is paramount, while the decoder can be sparsified heavily with little accuracy loss.
    • For speech translation (ST), both the encoder and the decoder require more balanced attention, as the decoder’s feed-forward layers are essential.
    • In multilingual or multitask scenarios, module selection adapts but shows consistent patterns within each type, highlighting the learned specialization within the architecture.

    Broader Implications

    This dynamic, modular pruning opens the door for:

    • More energy-efficient, scalable AI—especially vital as LLMs and multimodal models continue to grow.
    • AI models that can personalize their compute pathways—not only by task but potentially by user profile, region, or device.
    • Transferability to other domains, such as natural language processing and computer vision, wherever foundation models are used.

    By selectively activating only task-relevant modules in real time, inspired by biological neural efficiency, Amazon’s architecture points the way toward AI that is both powerful and practical for global, real-world use.


    Check out the Paper and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous Article9 Open Source Cursor Alternatives You Should Use in 2025
    Next Article How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    9 Open Source Cursor Alternatives You Should Use in 2025

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    13 Useful Free and Open Source Linux Column-Oriented Databases

    Linux

    CVE-2025-4774 – Elementor Premium Addons Stored Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-6621 – TOTOLINK CA300-PoE Os Command Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2023-45721 – HCL Leap Unauthenticated Directory Information Exposure

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-54064 – Rucio Apache Access Log Credentials Exposure

    July 17, 2025

    CVE ID : CVE-2025-54064

    Published : July 17, 2025, 3:15 p.m. | 3 hours, 21 minutes ago

    Description : Rucio is a software framework that provides functionality to organize, manage, and access large volumes of scientific data using customizable policies. The common Rucio helm-charts for the `rucio-server`, `rucio-ui`, and `rucio-webui` define the log format for the apache access log of these components. The `X-Rucio-Auth-Token`, which is part of each request header sent to Rucio, is part of this log format. Thus, each access log line potentially exposes the credentials (Internal Rucio token, or JWT in case of OIDC authentication) of the user. Due to the length of the token (Especially for a JWT) the tokens are often truncated, and thus not usable as credential; nevertheless, the (partial) credential should not be part of the logfile. The impact of this issue is amplified if the access logs are made available to a larger group of people than the instance administrators themselves. An updated release has been supplied for the `rucio-server`, `rucio-ui` and `rucio-webui` helm-chart. The change was also retrofitted for the currently supported Rucio LTS releases. The patched versions are rucio-server 37.0.2, 35.0.1, and 32.0.1; rucio-ui 37.0.4, 35.0.1, and 32.0.2; and rucio-webui 37.0.2, 35.1.1, and 32.0.1. As a workaround, one may update the `logFormat` variable and remove the `X-Rucio-Auth-Token`.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Pixel 7a battery problems? Google might fix it for free – here’s how to check

    April 24, 2025

    Intel could be prepping a new mid-range GPU, giving gamers another reason to skip NVIDIA’s RTX 5060

    May 15, 2025
    How to fix Xbox controller bumpers and buttons without cracking it open

    How to fix Xbox controller bumpers and buttons without cracking it open

    April 11, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.