Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»MoMA: An Open-Vocabulary and Training Free Personalized Image Model that Boasts Flexible Zero-Shot Capabilities

    MoMA: An Open-Vocabulary and Training Free Personalized Image Model that Boasts Flexible Zero-Shot Capabilities

    April 12, 2024

    Modern image-generating tools have come a long way thanks to large-scale text-to-image diffusion models like GLIDE, DALL-E 2, Imagen, Stable Diffusion, and eDiff-I. Thanks to these models, users can create realistic pictures using a variety of textual cues. Kandinsky and Stable Unclip take images as inputs to generate variations that retain the visual components of the reference. The emergence of image-conditioned generation works like Kandinsky, and Stable Unclip is a response to the fact that textual descriptions, while effective, frequently fail to convey detailed visual features.

    Image personalization or subject-driven generation is the next logical step in this area. Early attempts in this field include using learnable text tokens to represent target concepts and converting input photos to text. Nevertheless, the substantial resources needed for instance-specific tuning and model storage severely restrict the practicality of these approaches despite their accuracy. To overcome these constraints, tuning-free methods have become more popular. Despite their efficacy in modifying textures, these methods frequently produce tuning-free detail defects and necessitate further tuning to achieve ideal results with target objects.

    A recent study by ByteDance and Rutgers University presents a new model called MoMA for picture personalization that does not require tweaking and uses an open vocabulary. It overcomes these issues by effectively integrating logical textual prompts, achieving excellent detail fidelity, and resembling object identities. MoMA for text-to-image diffusion model rapid picture customization.

    This approach consists of three parts:

    First, the researchers use a generative multimodal decoder to retrieve the reference picture’s features. Then, they alter them according to the target prompt to get the contextualized image feature. 

    Meanwhile, they use the original UNet’s self-attention layers to extract the object image feature by replacing the background of the original image with white color and leaving only the object’s pixels. 

    Lastly, they used the UNet diffusion model with the object-cross-attention layers and the contextualized picture attributes to generate new images. The layers were trained specifically for this purpose.

    The team used the OpenImage-V7 dataset to build a dataset of 282K image/caption/image-mask triplets for model training. After generating image captions using BLIP-2 OPT6.7B, any subjects pertaining to humans, color, form, and texture keywords were eliminated.

    The experimental results speak volumes about the MoMA model’s superiority. By harnessing the power of Multimodal Large Language Models (MLLMs), the model seamlessly combines the visual characteristics of the target object with text prompts, enabling changes to both the backdrop context and object texture. The suggested self-attention shortcut significantly enhances detail quality while imposing a minimal computational burden. The model’s expanded applicability is a testament to its potential, as it can be directly integrated with community models that have been fine-tuned using the same basic model, opening up new possibilities in the field of image generation and machine learning. 

    Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 40k+ ML SubReddit

    The post MoMA: An Open-Vocabulary and Training Free Personalized Image Model that Boasts Flexible Zero-Shot Capabilities appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeta AI Presents MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
    Next Article U.S. Federal Agencies Ordered to Hunt for Signs of Microsoft Breach and Mitigate Risks

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Tackling cloud native turbulence with platform engineering

    Development

    Upgrade strategies for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL 12

    Databases

    Gaining the Edge: How to Leverage Blockchain for a Competitive Advantage 🚀🔗

    Web Development

    IICS Micro and Macro Services

    Development
    GetResponse

    Highlights

    News & Updates

    Microsoft launches its first Intel-powered Copilot+ PCs with new Surface Pro and Surface Laptop, and they’re shockingly expensive

    January 30, 2025

    Microsoft is launching new Surface PCs today in the form of an updated Surface Pro…

    InZone : Be in a Success Story

    January 23, 2025

    TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

    May 17, 2024

    Path of Exile 2 freezes PC after Windows 11 24H2? Microsoft is looking into it

    January 1, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.