Bioinformatics

Generative AI: Text generation using Long short-term memory (LSTM) model

In the world of deep learning, generating sequence data is a fundamental task. Typically, this involves training a network, often an RNN (Recurrent Neural Network) or a convnet (Convolutional Neural Network), to predict the next token or a sequence of tokens in a given sequence, using the preceding tokens as input. For example, when provided with the input “the cat is on the ma,” the network’s objective is to predict the next character, such as ‘t.

Omics Playground: Derive biological insights from your omics data at your fingertip

Disclaimer: This post is sponsored by BigOmics platform. I have personally tested the platform. The opinions and views expressed in this post are solely those of the author and do not represent the views of my employer. A brief description of the platform. What challenges could the platform solve? The BigOmics platform - Omics Playground- provides a simplified approach for the effective processing of bulk RNA-seq data and proteomics data, resolving many issues experienced by scientists in the field.

How to use 1d convolutional neural network (conv1d) to predict DNA sequence binding to protein

In the mysterious world of DNA, where the secrets of life are encoded, scientists are harnessing the power of cutting-edge technology to decipher the language of genes. One of the remarkable tools they’re using is the 1D Convolutionary Neural Network, or 1D CNN, which might sound like jargon from a sci-fi movie, but it’s actually a game-changer in DNA sequence analysis. Imagine DNA as a long, intricate string of letters, like a never-ending alphabet book.

multi-omics data integration: a case study with transcriptomics and genomics mutation data

Multi-omics data analysis is a cutting-edge approach in biology that involves studying and integrating information from multiple biological “omics” sources. These omics sources include genomics (genes and their variations), transcriptomics (gene expression and RNA data), proteomics (proteins and their interactions), metabolomics (small molecules and metabolites), epigenomics (epigenetic modifications), and more. By analyzing data from various omics levels together, we can gain a more comprehensive and detailed understanding of biological systems and their complexities.

neighborhood/cellular niches analysis with spatial transcriptome data in Seurat and Bioconductor

Spatial transcriptome cellular niche analysis using 10x xenium data go to https://www.10xgenomics.com/resources/datasets There is a lung cancer and a breast cancer dataset. Let’s work on the lung cancer one. https://www.10xgenomics.com/resources/datasets/xenium-human-lung-preview-data-1-standard 37G zipped file! wget https://s3-us-west-2.amazonaws.com/10x.files/samples/xenium/1.3.0/Xenium_Preview_Human_Lung_Cancer_With_Add_on_2_FFPE/Xenium_Preview_Human_Lung_Cancer_With_Add_on_2_FFPE_outs.zip unzip Xenium_Preview_Human_Lung_Cancer_With_Add_on_2_FFPE_outs.zip sudo tar xvzf cell_feature_matrix.tar.gz cell_feature_matrix/ cell_feature_matrix/barcodes.tsv.gz cell_feature_matrix/features.tsv.gz cell_feature_matrix/matrix.mtx.gz read in the data with Seurat We really only care about the cell by gene count matrix which is inside the cell_feature_matrix folder, and the cell location x,y coordinates: cells.

scRNAseq clustering significance test: an unsolvable problem?

Introductioon In scRNA-seq data analysis, one of the most crucial and demanding tasks is determining the optimal resolution and cluster number. Achieving an appropriate balance between over-clustering and under-clustering is often intricate, as it directly impacts the identification of distinct cell populations and biological insights. The clustering algorithms have many parameters to tune and it can generate more clusters if e.g., you increase the resolution parameter. However, whether the newly generated clusters are meaningful or not is a question.

Navigating Variant Calling for Disease-Causing Mutations: The state-of-art process

Disclaimer: This post is sponsored by Watershed Omics Bench platform. I have personally tested the platform. The opinions and views expressed in this post are solely those of the author and do not represent the views of my employer Variant calling is the process of identifying and categorizing genetic variants in sequencing data. It is a critical step in the analysis of whole-genome sequencing (WGS) and whole-exome sequencing (WES) data, as it allows researchers to identify potential disease-causing mutations.

Reuse the single cell data! How to create a seurat object from GEO datasets

Download the data https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256 cd data/GSE116256 wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE116nnn/GSE116256/suppl/GSE116256_RAW.tar tar xvf GSE116256_RAW.tar rm GSE116256_RAW.tar Depending on how the authors upload their data. Some authors may just upload the merged count matrix file. This is the easiest situation. In this dataset, each sample has a separate set of matrix (*dem.txt.gz), features and barcodes. Total, there are 43 samples. For each sample, it has an associated metadata file (*anno.txt.gz) too. You can inspect the files in command line:

10 single-cell data benchmarking papers

I tweeted it at https://twitter.com/tangming2005/status/1679120948140572672 I got asked to put all my posts in a central place and I think it is a good idea. And here it is! Benchmarking integration of single-cell differential expression Benchmarking atlas-level data integration in single-cell genomics A review of computational strategies for denoising and imputation of single-cell transcriptomic data Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution

How to add boxplots or density plots side-by-side a scatterplot: a single cell case study

introduce ggside using single cell data The ggside R package provides a new way to visualize data by combining the flexibility of ggplot2 with the power of side-by-side plots. We will use a single cell dataset to demonstrate its usage. ggside allows users to create side-by-side plots of multiple variables, such as gene expression, cell type, and experimental conditions. This can be helpful for identifying patterns and trends in scRNA-seq data that would be difficult to see in individual plots.