Annotating KEGG compounds to pathway

To annotate a list of KEGG compounds to the KEGG pathways where they are involved I used the R package KEGGREST from Bioconductor. library(KEGGREST) So, having a list of KEGG compounds saved in a character vector like kegg_compounds, we use the method keggGet in batches of maximum 10 compounds to annotate them. The following (rudimentary) code, queries the database in batches of ten compounds fiddling a list (pathways) where it creates an entry per pathway and updates the field compounds with the compounds from kegg_compounds for each pathway.

minfi betas and residuals from methylation models

In the HELIX project we decided to use residuals instead of M values for the methylation analyses. So, how we get the residuals of a basic lineal model? Libraries and Data First of all we load the libraries we will use in this test: library( limma ) # We use lmFit to fit the lineal model library( minfi ) # Methylation data is saved as a GenomicRatioSet library( SmartSVA ) # We want to compute the SVA to correct methylation data library( isva ) # " library( Biobase ) # We will sabe the residuals in an ExpressionSet Once the libraries are loaded we proceed to obtain the methylation data:

Exploring public NHANES data using Rcupcake

The Rcupcake package contains functions to query different databases through the BD2K RESTful API. BD2K RESTful API is an interface that provides access to different data sources, making easier data accessibility, analysis reproducibility and scalability. The package is installed via devtools using it’s GitHub URL (hms-dbmi/Rcupcake) library( Rcupcake ) Rcupcake package follows a four-step process to retrieve the data from a database: Start session Select the variables of interest Build the JSON query Run the query to obtain the data The start.

Getting docker virtual environment IP in Windows 10

Docker is a full development platform for creating containerized apps. It is a platform available for Windows, GNU/Linux and MAC here. Unfortunately, for Windows users, the docker version you can get depends on the Windows you are running. Windows Docker Access Windows Home Docker Toolbox link Windows Pro Docket link This is because Windows Home systems goes without Hyper-V.

Comparing 'user' Internet connection from some Catalan research centers

Using the same technique seen in the old post “Comparing ping time between connections” I asked some colleges to run the following command in their research centers. ping -c 200 > ping_google.txt So, I load the multiple ping-files to create a data.frame with the icmp_seq number, the time spend per ping and the institution where the ping was promoted. ping <- lapply( files, function( file ) { dta <- read.

Extract paired-end reads from (NCBI) SRA files

SRA stores all the sequencing from GIO experiments in files in .sra format. These files are managed using the SRA Toolkit. I recently download some .sra files from this GEO corresponding to paired-end sequencing data. My surprise when I run fastq-dump (from SRA toolkit) utility and I got only one file rather than two. From the documentation of the tool, it seems that the option –split-files should be enough but not.

Convert all .bam of a folder to .sam

Gene-Enrichment in PsyGeNET's Main-Psychiatric-Disorders

PsyGeNET is a database that integrates information on psychiatric disorders and their genes (check its About page for more information). The current version of the database centered the information of three main psychiatric disorders: Alcoholism, Depression and Cocaine-Related-Disorders. Currently the author of PsyGeNET, Alba Gutiérrez, and me are developing an R package (PsyGeNET2R) to query the information stored into the database and to perform some analysis using this information. We thought that could be a good idea to perform an enrichment analysis on the three main psychiatric disorders given a list of genes of interest.

Understanding hypergeometric tests

Hypergeometric test are useful to perform enrichment analysis. As I see, the most performed enrichment analysis is the one where people want to obtain a list of enriched GO terms given a list of genes. The hypergeometric test is the equivalent of the one-tailed Fisher’s exact test, giving the statistical confidence in p-values. For example, given a shuffled poker deck with no jokers we want to see if getting five random cards the result is diamond-enriched:

Playing with TCGA .CEL files and TCGA Barcodes

Today I want a file relating the names of the .CEL files from TCGA, the barcodes for this samples and the definition of the sample type in the three available forms (numeric, short and description). An example, the following: filename barcode sampletype_numeric sampletype_short sampletype_desc TCGA_666_A01_0070X01.CEL TCGA-ZZ-A6AW-01A-01A-X00D-AB 01 TP Primary solid Tumor TCGA_666_A02_0070X01.CEL TCGA-ZZ-A6AW-10A-01A-X00G-AB 10 NB Blood Derived Normal TCGA_666_A03_0070X01.CEL TCGA-ZZ-A6AW-01A-01A-X00D-AB 01 TP Primary solid Tumor In order to provide a reusable way to create this file I wrote the following function in R: