Usage¶

Note

MetaFunc uses Kaiju’s nr_euk as reference database, and databases provided are also built off of the nr_euk database.

Install Dependencies¶

Miniconda is used to create software environments necessary to run the workflow. The workflow is then run using the workflow engine Snakemake. In order to run the workflow with Singularity, a working Singularity 3.5 installation is needed.

Install miniconda¶

To install miniconda on Linux:

$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

To install miniconda on MacOS:

$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh

Install Snakemake¶

Create a conda environment and install Snakemake.

$ conda create -n snakemake snakemake=5.9.1

Note

We highly recommend Snakemake version 5.9.1 as, as of time of writing, higher versions do not work with Singularity arguments.

Databases¶

You may either download databases pre-packaged, or build your own databases.

Download Databases¶

We provide pre-built databases for the workflow downloadable from Zenodo. The data is quite big (~56 GB) so download times can vary.

$ wget https://zenodo.org/record/5602157/files/nrgo_202001_updated.tar.bz2
$ wget https://zenodo.org/record/5602157/files/202001_nrgo_md5sums.txt
$ wget https://zenodo.org/record/5602178/files/kaijudb_nreuk_202001.tar.bz2
$ wget https://zenodo.org/record/5602178/files/202001_kaijudb_md5sums.txt

Note

These databases were compiled using Kaiju and nrgo. Files used to build the databases were obtained around January 27 - 28, 2020.

Check if download was successful:

$ md5sum -c 202001_nrgo_md5sums.txt
nrgo_202001_updated.tar.bz2: OK

$ md5sum -c 202001_kaijudb_md5sums.txt
kaijudb_nreuk_202001.tar.bz2: OK

Uncompress the tarballed, bzip’ed-files:

$ tar -xvjf kaijudb_nreuk_202001.tar.bz2
$ tar -xvjf nrgo_202001_updated.tar.bz2

Take note of the paths, where you placed the uncompressed directories as they will be necessary for Adjust config.yaml.

Build Databases¶

Alternatively, you can create your own databases but the process might take some time to complete. There are two databases needed to run the workflow:

1. Building Kaiju Databases:¶

You need to build a nr_euk Kaiju database. Create a directory where these database files will be created. Then run the following commands in that directory:

$ conda create -n kaiju kaiju=1.7.3
$ conda activate kaiju
$ kaiju-makedb -s nr_euk -t 12
# -t refers to the number of cores to be used in creating the database and can be changed depending on your machine's capacity

Do not delete just yet any files created by kaiju-makedb, even if they are not necessary to run Kaiju. We will use some of those files to build necessary databases for our purposes. Take note of the database directory path. This is necessary for section: Adjust config.yaml.

2. Building nrgo Databases:¶

Using some of the files downloaded by Kaiju, we can now run the nrgo workflow to obtain the following files:

groups.pickle

strain2species.pickle

nrgo.sqlite

go-basic.obo

taxon configuration file

Take note of the path to the results directory. This is necessary for section Adjust config.yaml.

3. Create a mapping reference index of the host (optional)¶

We suggest to first remove reads from the input that belong to the host, however this is optional. The workflow can map all reads to a reference genome index. This index need to be built before running the workflow. For example for the human host:

# download host genome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.primary_assembly.genome.fa.gz
gzip -d GRCh38.primary_assembly.genome.fa.gz

# download corresponding gene annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.primary_assembly.annotation.gtf.gz
gzip -d gencode.v33.primary_assembly.annotation.gtf.gz

# build STAR index
conda create --yes -n star star=2.7.3a
conda activate star
mkdir genome
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir genome --genomeFastaFiles GRCh38.primary_assembly.genome.fa

4. Download `.gmt` file for host GSEA (optional)¶

Currently, for humans, we recommend downloading the Gene Matrix Transposed (gmt) type of files from the Molecular Signatures Database (MSigDB) v7.0.

Important

Ensure that the gene IDs in your gmt-file are of the same format as your gtf- annotation file. Depending on the used identifers in the reference one needs to convert to the identifier type or use different source for your gmt-file, e.g. ConsensusPathDB.

Install Workflow¶

$ git clone https://gitlab.com/schmeierlab/workflows/metafunc.git

Adjust config.yaml¶

Currently it is required to submit the config parameters via --configfile config.yaml. Change the options in config.yaml:

resultdir: Point to the directory where results of this workflow will be saved.

samples: Point to the sample file. See section: Sample Sheet Format.

trimming:

perform: Set to True to trim reads using fastp.

extra: Specify any extra parameters for fastp

mapping:

perform: Set to True if mapping with STAR to a host genome.

index: Path to host genome index generated with STAR

strandedness: Strandedness of RNA-seq data. Can be one of None, yes, and reverse

star_extra: Specify any extra parameters for STAR

gsea_gmt: Path to gmt-file for use in GSEA. If empty string “”, GSEA will not be performed.

if groups are specified, the following options will be used:

dgea_mincount: Minimum count threshold to retain a gene in DGEA. Default is 1.

cor_pv: Maximum FDR for a gene or species to be used for correlation analysis. Default is 0.05.

cor_top: Use top n DEG/DA based on FDR for correlation analysis (e.g. if set to 100, use top 100 DEG/DA for correlation analysis). Default is 100.

groups:

use: Indicate True if samples belong to groups/conditions that may be compared (e.g. disease vs normal).

contrasts:

contrast1-vs-contrast2 (Label referring to group comparisons to be made)

contrast 1 (baseline group in comparisons (e.g. normal))

contrast 2 (effect group in comparisons (e.g. with disease))

DA_mincount: Minimum count threshold to retain a species for DA. Default is 1.

kaijuDB:

dir: Point to the Kaiju database directory. See section: Databases

fmi: Specify the exact .fmi file in the dir directory.

kaiju_NRGO: Point to the nrgo results directory. See section: Databases

KaijuRun:

kaijuOptions: Specify additional (non-default) Kaiju parameters as a string.

Filters:

abundance: Percent abundance cutoff for a taxid to be included in analysis (i.e. taxid should be at least equal to cutoff in at least 1 of the samples to be analyzed). Default is 0.001%.

10. TaxChoices: Give a list of all taxon names from kaiju-taxonlistEuk.tsv for which the user would like to obtain Functional/Gene Ontology summaries. Default is Viruses.

Taxon 1

Taxon 2

…

Taxon N

Sample Sheet Format¶

The sample sheet should have the following format:

sample	file_mate1	file_mate2	file_singles	group
Sample Name	Paired-end forward read	Paired-end reverse read	Single Read (or singleton from trimming)	Condition to test
…	…	…	…	…
…	…	…	…	…

Note

The workflow can consider different combinations of input reads. Entries in columns 2 and 3 MUST be filled together. An entry in column 4 can exist on its own or in combination with entries in columns 2 and 3. Entries in either columns 2 and 3, or column 4 are required (or all 3). If column 5 is not specified, a default value of “NoGroup” is given to all samples.

Execute Workflow¶

The workflow can be executed in different “modes”, depending if certain 3rd party software is installed on the system. In particular Singularity (virtualisation software) makes the workflow more reproducibile but might not be readily installed on all systems. In that case use the “conda-only mode”.

### activate snakemake environment
conda activate snakemake

### Do a dryrun of the workflow, show rules, order, and commands
$ snakemake -np --configfile config.yaml

### Recommended singularity-only mode for reproducibility
# No tool download needed. Singularity container will be pulled and used for the workflow.
# Only works on systems where Singularity has been installed.
$ snakemake -p --use-singularity --cores 12 --resources const=1 --configfile config.yaml > run.log 2>&1

# If necessary bind more folders for singularity outside of your home.
$ snakemake -p --use-singularity --cores 12 --singularity-args "--bind /mnt/disk2" --resources const=1 --configfile config.yaml > run.log 2>&1

### Conda-only run
# Will download tools and store them in environment files in the .snakemake/conda dir
$ snakemake -p --use-conda --cores 12 --resources const=1  --configfile config.yaml > run.log 2>&1

Investigate Results¶

Host¶

Gene ID: Sample Tables

see Files: all.counts.tsv or all.tpm.tsv
Differential Gene Expression Analysis

see Differential Gene Expression Analysis (DGEA)
Gene Set Enrichment Analysis

see Gene Set Enrichment Analysis (GSEA)

Microbiome¶

TaxIDTables

see Result files in directory taxonomy
Species Differential Abundance Analysis

see Differential Abundance Tables
Functional

see Result files in directory function
Supplementary Files (in ``source/`` directory)

see section: Result files in directory source for detailed description of files in source directory
Log Files

Some important information may be found in Log files

Host-Microbiome¶

Spearman Correlation Analysis

see Result files in host-microbiome/correlation

R Shiny Visualisation¶

MetaFunc creates an interactive application that allows for exploration of results. For instructions on how to launch application, see R Shiny Application.

Usage¶

Install Dependencies¶

Install miniconda¶

Install Snakemake¶

Databases¶

Download Databases¶

Build Databases¶

1. Building Kaiju Databases:¶

2. Building nrgo Databases:¶

3. Create a mapping reference index of the host (optional)¶

4. Download .gmt file for host GSEA (optional)¶

Install Workflow¶

Adjust config.yaml¶

Sample Sheet Format¶

Execute Workflow¶

Investigate Results¶

Host¶

Microbiome¶

Host-Microbiome¶

R Shiny Visualisation¶

4. Download `.gmt` file for host GSEA (optional)¶