Bioinformatics

Getting docker virtual environment IP in Windows 10

Docker is a full development platform for creating containerized apps. It is a platform available for Windows, GNU/Linux and MAC here. Unfortunately, for Windows users, the docker version you can get depends on the Windows you are running. Windows Docker Access Windows Home Docker Toolbox link Windows Pro Docket link This is because Windows Home systems goes without Hyper-V.

Comparing 'user' Internet connection from some Catalan research centers

Using the same technique seen in the old post “Comparing ping time between connections” I asked some colleges to run the following command in their research centers. ping www.google.com -c 200 > ping_google.txt So, I load the multiple ping-files to create a data.frame with the icmp_seq number, the time spend per ping and the institution where the ping was promoted. ping <- lapply( files, function( file ) { dta <- read.

Extract paired-end reads from (NCBI) SRA files

SRA stores all the sequencing from GIO experiments in files in .sra format. These files are managed using the SRA Toolkit. I recently download some .sra files from this GEO corresponding to paired-end sequencing data. My surprise when I run fastq-dump (from SRA toolkit) utility and I got only one file rather than two. From the documentation of the tool, it seems that the option –split-files should be enough but not.

Convert all .bam of a folder to .sam

Gene-Enrichment in PsyGeNET's Main-Psychiatric-Disorders

PsyGeNET is a database that integrates information on psychiatric disorders and their genes (check its About page for more information). The current version of the database centered the information of three main psychiatric disorders: Alcoholism, Depression and Cocaine-Related-Disorders. Currently the author of PsyGeNET, Alba Gutiérrez, and me are developing an R package (PsyGeNET2R) to query the information stored into the database and to perform some analysis using this information. We thought that could be a good idea to perform an enrichment analysis on the three main psychiatric disorders given a list of genes of interest.

Understanding hypergeometric tests

Hypergeometric test are useful to perform enrichment analysis. As I see, the most performed enrichment analysis is the one where people want to obtain a list of enriched GO terms given a list of genes. The hypergeometric test is the equivalent of the one-tailed Fisher’s exact test, giving the statistical confidence in p-values. For example, given a shuffled poker deck with no jokers we want to see if getting five random cards the result is diamond-enriched:

Playing with TCGA .CEL files and TCGA Barcodes

Today I want a file relating the names of the .CEL files from TCGA, the barcodes for this samples and the definition of the sample type in the three available forms (numeric, short and description). An example, the following: filename barcode sampletype_numeric sampletype_short sampletype_desc TCGA_666_A01_0070X01.CEL TCGA-ZZ-A6AW-01A-01A-X00D-AB 01 TP Primary solid Tumor TCGA_666_A02_0070X01.CEL TCGA-ZZ-A6AW-10A-01A-X00G-AB 10 NB Blood Derived Normal TCGA_666_A03_0070X01.CEL TCGA-ZZ-A6AW-01A-01A-X00D-AB 01 TP Primary solid Tumor In order to provide a reusable way to create this file I wrote the following function in R:

TCGA Barcodes Summary

All TCGA barcodes are created by the Biospecimen Core Resource (BCR). The following post wants to be a simple summary of the meaning of each part of a TCGA barcode in order to have a easy-to-look reference page. Project: Project name (ej. TCGA ) TSS: Tissue source site (ej. 02 ) Participant: Study participant (ej. 0001 ) Sample: Sample type (ej. 01 ) Vial: Order of sample in a sequence of samples (ej.