quanTIseq documentation

Last updated: February 25th, 2019

Introduction

quanTIseq is a computational pipeline for the quantification of the Tumor Immune contexture from human RNA-seq data [1].

quanTIseq takes as input FASTQ files of RNA-seq reads from tumor samples or other cell mixtures and quantifies via deconvolution the proportions of ten different immune cell types and the fraction of other uncharacterized cells present in the heterogeneous sample (Table 1).

quanTIseq analysis consists of three steps (Figure 1):

  1. Pre-processing of raw RNA-seq reads (single- or paired-ends) with Trimmomatic [2] to remove Illumina adapter sequences, trim low-quality read ends, crop long-reads to a maximum length, and discard short reads.
  2. Quantification of gene expression with Kallisto [3] as transcripts per millions (TPM) [4] and raw counts.
  3. Expression normalization, gene re-annotation, deconvolution of cell fractions based on constrained least squares regression, and computation of cell densities (if total cells per mm2 are available from images of tumor tissue-slides).

The analysis can be run on single or multiple samples and can be initiated at any step. For instance, pre-computed expression matrices can be analyzed directly with the deconvolution module (step 3 in Figure 1).

quanTIseq can be run on Mac OS X (supported by Docker) or Linux systems (supported by Singularity) by simply calling the "quanTIseq_pipeline.sh" script (see Usage). The pipeline code and third-party software are embedded in a Docker image that simplifies quanTIseq installation and usage (see Installation).


Table 1: The cell types quantified by quanTIseq
Cell type Cell ID in the results
B cells B.cells
Classically activated macrophages (M1) Macrophages.M1
Alternatively activated macrophages (M2) Macrophages.M2
Monocytes Monocytes
Neutrophils Neutrophils
Natural killer (NK) cells NK.cells
Non-regulatory CD4+ T cells * T.cells.CD4
CD8+ T cells T.cells.CD8
Regulatory CD4+ T (Treg) cells Tregs
Dendritic cells Dendritic.cells
Other uncharacterized cells Other

* Note: T.cells.CD4 corresponds to non-regulatory CD4+ T cell fractions. Total CD4+ T cell fractions can be obtained by summing up T.cells.CD4 and Tregs.

Figure 1: quanTIseq pipeline

Installation

Mac OS X (based on Docker)

  1. Install Docker (instructions here).
  2. Download the "quanTIseq_pipeline.sh" script from here.

Linux(based on Singularity)

  1. Install Singularity (instructions here).
  2. Download the "quanTIseq_pipeline.sh" script from here.

Usage

To run quanTIseq, execute the following command:

bash quanTIseq_pipeline.sh --inputfile=path/to/input_file.txt --outputdir=path/to/outputdir [options]

--inputfile: path to the input file (mandatory). The input file is a tab-delimited text file, with no header, containing the information about the RNA-seq data to be analyzed, one sample per row. For each sample, it must contain three columns specifying: the ID of the sample, the path to the first FASTQ file, and the path to the second FASTQ file for paired-end data. For single-end data, the third column must report the string "None". FASTQ files can be gzipped (".gz" extension). To run quanTIseq with the "--pipelinestart=decon" option, the input file must be a tab-delimited text file with the gene TPM (or microarray expression values) for all samples to be deconvoluted (gene symbols on the first column and sample IDs on the first row). Expression data must be on non-log scale.


--outputdir: path to the output directory (mandatory). If it does not exist, quanTIseq will create it.


Notes

  • quanTIseq automatically selects Docker or Singularity depending on your computer OS.
  • On Mac OS X, you need to have Docker running to execute quanTIseq.
  • On Mac OS X, you might need to increase CPUs, Memory and Swap in Docker settings (Settings > Preferences > Advanced).
  • On Linux, quanTIseq stores the Singularity image "quantiseq2.img" in the current directory .

Options

--help: prints these instructions.

--pipelinestart: step from which the pipeline should be started as illustrated in Figure 1: (1) "preproc", (2) "expr", or (3) "decon". Note: if "decon", the input file must be a tab-delimited text file with the gene TPM (or microarray expression values) for all samples to be deconvoluted (gene symbols on the first column and sample IDs on the first row). Expression data must be on non-log scale.
Default: "preproc".

--tumor: specifies whether expression data are from tumor samples. If TRUE, signature genes with high expression in tumor samples are removed (see [1]). We highly recommend setting "--tumor=TRUE" when analyzing tumor data.
Default: FALSE.

--arrays: specifies whether expression data are from microarrays (instead of RNA-seq). If TRUE, the "--rmgenes" parameter is set to "none".
Default: FALSE.

--method: deconvolution method to be used: "hampel", "huber", or "bisquare" for robust regression with Huber, Hampel, or Tukey bisquare estimators, respectively, or "lsei" for constrained least squares regression. The fraction of uncharacterized cells ("other") is computed only by the "lsei" method, which estimates cell fractions referred to the total cells in the sample under investigation. For all the other methods, the cell fractions are referred to immune cells considered in the signature matrix. We recommend using the "lsei" method.
Default: "lsei".

--mRNAscale: specifies whether cell fractions must be scaled to account for cell-type-specific mRNA content. We highly recommend using the default setting: "--mRNAscale=TRUE".
Default: TRUE.

--totalcells: path to a tab-separated text file containing the total cell densities estimated from images of tumor tissue-slides. The file must have no header and contain, on the first column, the sample IDs, and on the second column, the number of total cells per mm2 estimated from tumor images (e.g. from images of haematoxylin and eosin (H&E)-stained tissue slides). quanTIseq computes cell densities for all the cell types of the signature matrix and all the samples present in both the "inputfile" and "totalcell" files. This parameter is optional and, if not set, only the deconvoluted cell fractions are returned.

--rmgenes: specifies which genes must be removed from the signature matrix before running deconvolution. This parameter can be equal to:

  1. "none": no genes are removed;
  2. "default": the list of genes reported in this file are removed;
  3. A path to a text file containing a list of genes to be removed can be specified. The text file must report one gene symbol per line, as in this example. Can be used to specify noisy genes that are known/thought to bias deconvolution results (e.g. immune genes with very high expression in the sample(s) of interest).
We advise using "default" and "none" for the deconvolution of RNA-seq and microarray data, respectively.
Default: "default".

--rawcounts: specifies whether a file of gene raw counts should be generated in addition to TPM.
Default: FALSE.

--prefix: prefix of the output files.
Default: "quanTIseq".

--threads: number of threads to be used. Note: kallisto results (gene counts and TPM) can differ slightly depending on the number of threads used.
Default: 1.

--phred: encoding of the RNA-seq quality scores for read preprocessing with Trimmomatic: "33" for Phred-33 or "64" for Phred-64.
Default: 33.

--adapterSeed: maximum number of seed mismatches for the identification of adapter sequences by Trimmomatic.
Default: 2.

--palindromeClip: threshold for palindrome clipping of adapter sequences by Trimmomatic.
Default: 30.

--simpleClip: threshold for simple clipping of adapter sequences by Trimmomatic.
Default: 10.

--trimLead: minimum Phred quality required by Trimmomatic to keep a base at the start of a read. Bases with lower quality are trimmed.
Default: 20.

--trimTrail: minimum Phred quality required by Trimmomatic to keep a base at the end of a read. Bases with lower quality are trimmed.
Default: 20.

--minLen: minimum read length required by Trimmomatic to keep a read. Reads shorter than this threshold are discared.
Default: 36.

--crop: maximum read length required by Trimmomatic. Longer reads are trimmed to this maximum length by removing bases at the end of the read.
Default: 10000.

--avgFragLen: estimated average fragment length required by Kallisto for single-end data.
Deafult: 50.

--sdFragLen: estimated standard deviation of fragment length required by Kallisto for single-end data.
Deafult: 20.

Output files

  • prefix_cell_fractions.txt: tab-delimited text file with the cell fractions estimated by quanTIseq (sample IDs on the first column and cell types on the first row). The fraction of uncharacterized cells (Table 1) is reported only when "--method=lsei".

  • prefix_gene_tpm.txt: tab-delimited text file with the gene expression in TPM (gene symbols on the first column and sample IDs on the first row). This file is NOT generated when the "--pipelinestart=decon" option is specified.

  • prefix_gene_count.txt: tab-delimited text file with the gene expression in read counts (gene symbols on the first column and sample IDs on the first row). This file is generated only when the "--rawcounts=TRUE" option is specified.

Test data

To test the quanTIseq pipeline on mock RNA-seq data (i.e. paired-end data from Sample1 and single-end data from Sample2), download in your working directory the "quanTIseq_pipeline.sh" script and the following test files:

The full quanTIseq pipeline can be run with the following command:

bash quanTIseq_pipeline.sh --inputfile=quanTIseqTest_input.txt --outputdir=quanTIseqTest --prefix=quanTIseqTest

Note: to run this test, quanTIseq must have been installed successfully (see Installation)

quanTIseq will generate the following output files:

Simulated data

The simulated data represent RNA-seq expression data from breast tumors with different immune-infiltration scenarios and were generated by mixing RNA-seq reads from purified immune cell types and from a MCF7 breast tumor cell line.

The simulated data sets were generated considering different immune relative cell proportions, tumor purity values (0:10:100%), and sequencing depths (1, 2, 5, 10, 20, 50, and 100 million read pairs). For a detailed description of the simulations, please refer to [1].

The following files are available for download and can be used to benchmark deconvolution algorithms:

Note: True mRNA fractions were simulated with no total-mRNA bias. Thus, these data should be analyzed specifying the "--mRNAscale=FALSE" option. quanTIseq estimates generated in this manner can be directly compared to the true mRNA fractions.

License

quanTIseq project is released under BSD 3-Clause License. The licensing scheme of third-party software is described here.

Citation

If you use quanTIseq in your research, please cite:

Finotello F, Mayer C, Plattner C, Laschober G, Rieder D, Hackl H, Krogsdam A, Loncova Z, Posch W, Wilflingseder D, Sopper S, Ijsselsteijn M, Brouwer TP, Johnson D, Xu Y, Wang Y, Sanders ME, Estrada MV, Ericsson-Gonzalez P, Charoentong P, Balko J, de Miranda NFDCC, Trajanoski Z. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Medicine, 2019. 11(1):34.

References

[1] Finotello F, Mayer C, Plattner C, Laschober G, Rieder D, Hackl H, Krogsdam A, Loncova Z, Posch W, Wilflingseder D, Sopper S, Ijsselsteijn M, Brouwer TP, Johnson D, Xu Y, Wang Y, Sanders ME, Estrada MV, Ericsson-Gonzalez P, Charoentong P, Balko J, de Miranda NFDCC, Trajanoski Z. Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data. Genome Medicine, 2019. 11(1):34.

[2] A. M. Bolger, M. Lohse, and B. Usadel, "Trimmomatic: a flexible trimmer for Illumina sequence data", Bioinforma. Oxf. Engl., vol. 30, no. 15, pp. 2114–2120, Aug. 2014.

[3] N. L. Bray, H. Pimentel, P. Melsted, and L. Pachter, "Near-optimal probabilistic RNA-seq quantification", Nat. Biotechnol., vol. 34, no. 5, pp. 525–527, May 2016.

[4] B. Li and C. N. Dewey, "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome", BMC Bioinformatics, vol. 12, p. 323, 2011.