Summary
- CircRNA databases have significant disparities, prompting the manual curation of a dataset of back-splicing (BS) and linear splicing (LS) exon pairs through data integration.
- The BS exon pairs dataset was constructed by combining data from the top 5 largest circRNA databases and filtering based on exon boundaries.
- To ensure high confidence in identifying LS exon pairs, available LS data for the human genome were leveraged to create the LS exon pairs dataset.
- A total of 15,700 common transcripts were identified between LS and BS exon pair datasets for comparison.
- Numerical methods were developed to detect reverse complementary matches (RCMs) and incorporated into models to classify BS and LS exon pairs.
Researchers have made significant progress in understanding circular RNA biogenesis, a type of RNA that forms a closed loop structure. In a recent study, scientists curated a dataset of exonic circRNA pairs by integrating data from various circRNA databases. These datasets were constructed to identify specific exon pairs involved in back-splicing (BS) and linear splicing (LS) processes.
To create the BS exon pairs dataset, data from large circRNA databases were combined. These databases contained a vast number of circRNAs based on genomic information. The researchers focused on circRNAs involving two or more exons and removed duplicates to refine the dataset. Additionally, they converted circRNA coordinates to ensure consistency and accuracy in their analysis.
For LS exon pairs, the researchers utilized data from a specialized database that provided information on splicing sites in human tissues. By analyzing splicing sites specific to normal tissues, the researchers identified exon pairs likely to be involved in linear splicing processes.
To compare BS and LS exon pairs, the researchers identified common transcripts and differentiated unique exon pairs exclusive to each splicing type. They also analyzed the distribution of reverse complementary matches (RCMs), which play a role in circRNA formation, within different genomic regions.
The study also involved the construction of base models, which are deep learning models designed to classify BS and LS exon pairs based on their sequence characteristics. These models were trained and optimized using a machine learning framework.
Furthermore, the researchers developed RCM models to incorporate information about RCM patterns related to circRNA formation. By integrating the base and RCM models, the researchers aimed to improve the accuracy of predicting circRNA biogenesis processes.
Overall, this research provides valuable insights into the mechanisms underlying circRNA formation and highlights the complexity of RNA splicing processes in the human genome. The findings contribute to our understanding of RNA biology and could have implications for future studies on gene regulation and disease mechanisms.
Source link
Pathology&LabMedicine, Oncology