Usage¶
Note
MetaFunc uses Kaiju’s nr_euk as reference database, and databases provided are also built off of the nr_euk database.
Install Dependencies¶
Miniconda is used to create software environments necessary to run the workflow. The workflow is then run using the workflow engine Snakemake. In order to run the workflow with Singularity, a working Singularity 3.5 installation is needed.
Install miniconda¶
To install miniconda on Linux:
$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
To install miniconda on MacOS:
$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh
Install Snakemake¶
Create a conda environment and install Snakemake.
$ conda create -n snakemake snakemake=5.9.1
Note
We highly recommend Snakemake version 5.9.1 as, as of time of writing, higher versions do not work with Singularity arguments.
Databases¶
You may either download databases pre-packaged, or build your own databases.
Download Databases¶
We provide pre-built databases for the workflow downloadable from Zenodo. The data is quite big (~56 GB) so download times can vary.
$ wget https://zenodo.org/record/5602157/files/nrgo_202001_updated.tar.bz2
$ wget https://zenodo.org/record/5602157/files/202001_nrgo_md5sums.txt
$ wget https://zenodo.org/record/5602178/files/kaijudb_nreuk_202001.tar.bz2
$ wget https://zenodo.org/record/5602178/files/202001_kaijudb_md5sums.txt
Note
These databases were compiled using Kaiju and nrgo. Files used to build the databases were obtained around January 27 - 28, 2020.
Check if download was successful:
$ md5sum -c 202001_nrgo_md5sums.txt
nrgo_202001_updated.tar.bz2: OK
$ md5sum -c 202001_kaijudb_md5sums.txt
kaijudb_nreuk_202001.tar.bz2: OK
Uncompress the tarballed, bzip’ed-files:
$ tar -xvjf kaijudb_nreuk_202001.tar.bz2
$ tar -xvjf nrgo_202001_updated.tar.bz2
Take note of the paths, where you placed the uncompressed directories as they will be necessary for Adjust config.yaml.
Build Databases¶
Alternatively, you can create your own databases but the process might take some time to complete. There are two databases needed to run the workflow:
1. Building Kaiju Databases:¶
You need to build a nr_euk Kaiju database.
Create a directory where these database files will be created.
Then run the following commands in that directory:
$ conda create -n kaiju kaiju=1.7.3
$ conda activate kaiju
$ kaiju-makedb -s nr_euk -t 12
# -t refers to the number of cores to be used in creating the database and can be changed depending on your machine's capacity
Do not delete just yet any files created by kaiju-makedb, even if
they are not necessary to run Kaiju. We will use some of those files
to build necessary databases for our purposes. Take note of the
database directory path. This is necessary for section: Adjust config.yaml.
2. Building nrgo Databases:¶
Using some of the files downloaded by Kaiju, we can now run the nrgo workflow to obtain the following files:
groups.pickle
strain2species.pickle
nrgo.sqlite
go-basic.obo
taxon configuration file
Take note of the path to the results directory. This is necessary
for section Adjust config.yaml.
3. Create a mapping reference index of the host (optional)¶
We suggest to first remove reads from the input that belong to the host, however this is optional. The workflow can map all reads to a reference genome index. This index need to be built before running the workflow. For example for the human host:
# download host genome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.primary_assembly.genome.fa.gz
gzip -d GRCh38.primary_assembly.genome.fa.gz
# download corresponding gene annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.primary_assembly.annotation.gtf.gz
gzip -d gencode.v33.primary_assembly.annotation.gtf.gz
# build STAR index
conda create --yes -n star star=2.7.3a
conda activate star
mkdir genome
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir genome --genomeFastaFiles GRCh38.primary_assembly.genome.fa
4. Download .gmt file for host GSEA (optional)¶
Currently, for humans, we recommend downloading the Gene Matrix Transposed (gmt) type of files from the Molecular Signatures Database (MSigDB) v7.0.
Important
Ensure that the gene IDs in your gmt-file are of the same format as your gtf- annotation file. Depending on the used identifers in the reference one needs to convert to the identifier type or use different source for your gmt-file, e.g. ConsensusPathDB.
Install Workflow¶
$ git clone https://gitlab.com/schmeierlab/workflows/metafunc.git
Adjust config.yaml¶
Currently it is required to submit the config parameters via
--configfile config.yaml. Change the options in config.yaml:
resultdir: Point to the directory where results of this workflow will be saved.
samples: Point to the sample file. See section: Sample Sheet Format.
trimming:
mapping:
perform: Set to True if mapping with STAR to a host genome.
index: Path to host genome index generated with STAR
strandedness: Strandedness of RNA-seq data. Can be one of None, yes, and reverse
star_extra: Specify any extra parameters for STAR
gsea_gmt: Path togmt-file for use in GSEA. If empty string “”, GSEA will not be performed.if groups are specified, the following options will be used:
dgea_mincount: Minimum count threshold to retain a gene in DGEA. Default is 1.
cor_pv: Maximum FDR for a gene or species to be used for correlation analysis. Default is 0.05.
cor_top: Use top n DEG/DA based on FDR for correlation analysis (e.g. if set to 100, use top 100 DEG/DA for correlation analysis). Default is 100.
groups:
use: Indicate True if samples belong to groups/conditions that may be compared (e.g. disease vs normal).
contrasts:contrast1-vs-contrast2 (Label referring to group comparisons to be made)
contrast 1 (baseline group in comparisons (e.g. normal))
contrast 2 (effect group in comparisons (e.g. with disease))
DA_mincount: Minimum count threshold to retain a species for DA. Default is 1.
kaijuDB:
kaiju_NRGO: Point to the nrgo results directory. See section: Databases
KaijuRun:
kaijuOptions: Specify additional (non-default) Kaiju parameters as a string.
Filters:
abundance: Percent abundance cutoff for a taxid to be included in analysis (i.e. taxid should be at least equal to cutoff in at least 1 of the samples to be analyzed). Default is 0.001%.10.
TaxChoices: Give a list of all taxon names from kaiju-taxonlistEuk.tsv for which the user would like to obtain Functional/Gene Ontology summaries. Default is Viruses.
Taxon 1
Taxon 2
…
Taxon N
Sample Sheet Format¶
The sample sheet should have the following format:
sample |
file_mate1 |
file_mate2 |
file_singles |
group |
|---|---|---|---|---|
Sample Name |
Paired-end forward read |
Paired-end reverse read |
Single Read (or singleton from trimming) |
Condition to test |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Note
The workflow can consider different combinations of input reads. Entries in columns 2 and 3 MUST be filled together. An entry in column 4 can exist on its own or in combination with entries in columns 2 and 3. Entries in either columns 2 and 3, or column 4 are required (or all 3). If column 5 is not specified, a default value of “NoGroup” is given to all samples.
Execute Workflow¶
The workflow can be executed in different “modes”, depending if certain 3rd party software is installed on the system. In particular Singularity (virtualisation software) makes the workflow more reproducibile but might not be readily installed on all systems. In that case use the “conda-only mode”.
### activate snakemake environment
conda activate snakemake
### Do a dryrun of the workflow, show rules, order, and commands
$ snakemake -np --configfile config.yaml
### Recommended singularity-only mode for reproducibility
# No tool download needed. Singularity container will be pulled and used for the workflow.
# Only works on systems where Singularity has been installed.
$ snakemake -p --use-singularity --cores 12 --resources const=1 --configfile config.yaml > run.log 2>&1
# If necessary bind more folders for singularity outside of your home.
$ snakemake -p --use-singularity --cores 12 --singularity-args "--bind /mnt/disk2" --resources const=1 --configfile config.yaml > run.log 2>&1
### Conda-only run
# Will download tools and store them in environment files in the .snakemake/conda dir
$ snakemake -p --use-conda --cores 12 --resources const=1 --configfile config.yaml > run.log 2>&1
Investigate Results¶
Host¶
Gene ID: Sample Tables
Differential Gene Expression Analysis
Gene Set Enrichment Analysis
Microbiome¶
TaxIDTables
Species Differential Abundance Analysis
Functional
Supplementary Files (in ``source/`` directory)
see section: Result files in directory source for detailed description of files in source directory
Log Files
Some important information may be found in Log files
Host-Microbiome¶
Spearman Correlation Analysis
R Shiny Visualisation¶
MetaFunc creates an interactive application that allows for exploration of results. For instructions on how to launch application, see R Shiny Application.