AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments

The study of evolution by natural selection at the molecular level has advanced significantly with the advent of genomic technologies. Traditionally, researchers have focused on observable traits like flowering time or growth. However, gene expression provides an intermediate phenotype that connects genomic data to these macroscopic traits, offering a deeper understanding of selection pressures. In a recent study involving Ivyleaf Morning Glory (*Ipomoea hederacea*), researchers utilized RNA sequencing to analyze gene expression under natural field conditions. The challenge of dealing with high-dimensional, small-sample-size data typical of transcriptomics was addressed using machine learning methods. These methods, known for their ability to handle complex, multivariate data, revealed that genes related to photosynthesis, stress response, and light response were crucial in predicting fitness. This demonstrates the potential of ML models to uncover important biological processes and genes under selection in natural environments, overcoming the limitations of traditional statistical approaches.

Additionally, the intricate patterns of codon usage, which vary significantly across and within species, are influenced by evolutionary selection. A study explored whether AI could predict codon sequences from given amino acid sequences in different organisms, including yeast and bacteria. The researchers used advanced AI models, specifically the mBART transformer-based architecture, to capture complex dependencies in codon usage that simple frequency-based methods fail to detect. Their findings indicate that AI can effectively learn and predict these codon patterns, particularly in highly expressed genes and longer proteins. This suggests that codon choice is influenced by evolutionary pressures related to protein expression and folding. This approach improves our understanding of codon bias and its impact on protein synthesis and provides a new tool for optimizing codon usage in biotechnology and synthetic biology applications.

Summary of Methods:

The study utilized NCBI coding sequences from S. cerevisiae, S. pombe, E. coli, and B. subtilis, divided into training, validation, and testing sets. CD-HIT clustered amino acid sequences, ensuring clusters remained within individual sets. BLAST identified similar sequences and expression levels categorized proteins. Codon prediction models included frequency-based methods and mBART models with varying configurations. The training protocol featured pretraining and fine-tuning with specific hyper-parameters. Fixed-sized windows were applied during inference, and predictions were averaged across windows: accuracy and perplexity metrics evaluated model performance against true codon sequences.

Training and Evaluation of mBART Models:

mBART models were trained to predict codon sequences from amino acid sequences using masking and mimicking. Masking involved predicting codons from the amino acid sequence alone while mimicking predicted codons based on those of an orthologous protein from a different organism. The mimicking approach is based on the hypothesis that codons can influence the translation elongation rate, which is critical for co-translational protein folding. Training datasets consisted of S. cerevisiae, S. pombe, E. coli, and B. subtilis proteins, divided into training, validation, and test sets with no amino acid sequence overlap between training and test sets. The evaluation of models showed that mBART models generally outperformed frequency-based baselines, especially in predicting codons for proteins with higher expression levels. This suggests that mBART can learn and utilize long-range interactions among codons more effectively.

Accuracy of Masking and Mimicking Predictions:

The mBART modelsâ€™ masking-mode predictions showed superior accuracy compared to frequency-based methods, demonstrating the ability to capture complex patterns in codon usage. Different window sizes were tested, with the 30-codon window model performing the best. Although mimicking-mode predictions were slightly more accurate than masking-mode predictions, they still showed potential, especially in eukaryotic organisms and for highly conserved orthologous segments. The mBART modelsâ€™ performance did not significantly benefit from sequence similarities between training and test sets, indicating robust learning of codon usage patterns. Additionally, the modelsâ€™ accuracy varied across proteins with different expression levels and molecular functions, with notable improvements for proteins involved in ribosomal functions, nucleic acid binding, and catalytic activities in S. cerevisiae and E. coli.

Image source

Methods:

Tissue was collected from Ipomoea hederacea, an annual vine distributed across the eastern USA. A field experiment involved planting 100 individuals from 56 populations in a glasshouse and transplanting them to a field. Soil samples were analyzed for heavy metals a year later. Leaf tissue was collected after 71 days, and mRNA was extracted and sequenced. Data processing included aligning reads to the Ipomoea nil genome, transforming gene counts, and filtering low-expression genes. Analytical methods involved principal component regression and supervised modeling using neural networks and gradient tree boosting. Important genes were identified, and GO term enrichment analysis was conducted using Blast2Go and goseq.

Insights from AI-Driven Codon Prediction and Gene Expression Analysis:

Advanced AI models, such as mBART, have been leveraged to predict codon usage across various organisms and to analyze gene expressionâ€™s impact on fitness. These models highlight significant correlations between codon usage and protein expression, evolutionary conservation, and functional attributes. High-expression genes and conserved proteins exhibit more predictable codon patterns. Additionally, machine learning approaches effectively identify gene expression patterns related to fitness, particularly in genes associated with stress response and reproductive development. This underscores the utility of AI in decoding complex biological sequences and enhancing our understanding of evolutionary biology and gene regulation.

Sources:

https://www.biorxiv.org/content/10.1101/2024.02.14.580307v1.full.pdf

https://www.biorxiv.org/content/10.1101/2024.02.11.579798v2

The post AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

CISA Adds Five-Year-Old jQuery XSS Flaw to Exploited Vulnerabilities List

Anthropic releases dataset that provides insight into how AI is influencing labor market

This Dell Inspiron is one of the most versatile, well-rounded laptops I’ve tested

Microsoft celebrates Windows 11â€™s new Outlook app growth after forcing people to use it

Build a multi-tenant generative AI environment for your enterprise on AWS

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Netrunner – Debian-based distribution

How to Prepare Your Business for the EU AI Act With KPMGâ€™s EU AI Hub

AI-Powered Insights into Molecular Evolution: From Codon Usage to Gene Expression in Natural Environments

Related Posts