Usage

Note

MetaFunc uses Kaiju’s nr_euk as reference database, and databases provided are also built off of the nr_euk database.

Install Dependencies

Miniconda is used to create software environments necessary to run the workflow. The workflow is then run using the workflow engine Snakemake. In order to run the workflow with Singularity, a working Singularity 3.5 installation is needed.

Install miniconda

To install miniconda on Linux:

$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

To install miniconda on MacOS:

$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh

Install Snakemake

Create a conda environment and install Snakemake.

$ conda create -n snakemake snakemake=5.9.1

Note

We highly recommend Snakemake version 5.9.1 as, as of time of writing, higher versions do not work with Singularity arguments.


Databases

You may either download databases pre-packaged, or build your own databases.

Download Databases

We provide pre-built databases for the workflow downloadable from Zenodo. The data is quite big (~56 GB) so download times can vary.

$ wget https://zenodo.org/record/5602157/files/nrgo_202001_updated.tar.bz2
$ wget https://zenodo.org/record/5602157/files/202001_nrgo_md5sums.txt
$ wget https://zenodo.org/record/5602178/files/kaijudb_nreuk_202001.tar.bz2
$ wget https://zenodo.org/record/5602178/files/202001_kaijudb_md5sums.txt

Note

These databases were compiled using Kaiju and nrgo. Files used to build the databases were obtained around January 27 - 28, 2020.

Check if download was successful:

$ md5sum -c 202001_nrgo_md5sums.txt
nrgo_202001_updated.tar.bz2: OK

$ md5sum -c 202001_kaijudb_md5sums.txt
kaijudb_nreuk_202001.tar.bz2: OK

Uncompress the tarballed, bzip’ed-files:

$ tar -xvjf kaijudb_nreuk_202001.tar.bz2
$ tar -xvjf nrgo_202001_updated.tar.bz2

Take note of the paths, where you placed the uncompressed directories as they will be necessary for Adjust config.yaml.

Build Databases

Alternatively, you can create your own databases but the process might take some time to complete. There are two databases needed to run the workflow:

1. Building Kaiju Databases:

You need to build a nr_euk Kaiju database. Create a directory where these database files will be created. Then run the following commands in that directory:

$ conda create -n kaiju kaiju=1.7.3
$ conda activate kaiju
$ kaiju-makedb -s nr_euk -t 12
# -t refers to the number of cores to be used in creating the database and can be changed depending on your machine's capacity

Do not delete just yet any files created by kaiju-makedb, even if they are not necessary to run Kaiju. We will use some of those files to build necessary databases for our purposes. Take note of the database directory path. This is necessary for section: Adjust config.yaml.

2. Building nrgo Databases:

Using some of the files downloaded by Kaiju, we can now run the nrgo workflow to obtain the following files:

  • groups.pickle

  • strain2species.pickle

  • nrgo.sqlite

  • go-basic.obo

  • taxon configuration file

Take note of the path to the results directory. This is necessary for section Adjust config.yaml.

3. Create a mapping reference index of the host (optional)

We suggest to first remove reads from the input that belong to the host, however this is optional. The workflow can map all reads to a reference genome index. This index need to be built before running the workflow. For example for the human host:

# download host genome
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/GRCh38.primary_assembly.genome.fa.gz
gzip -d GRCh38.primary_assembly.genome.fa.gz

# download corresponding gene annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.primary_assembly.annotation.gtf.gz
gzip -d gencode.v33.primary_assembly.annotation.gtf.gz

# build STAR index
conda create --yes -n star star=2.7.3a
conda activate star
mkdir genome
STAR --runMode genomeGenerate --runThreadN 12 --genomeDir genome --genomeFastaFiles GRCh38.primary_assembly.genome.fa

4. Download .gmt file for host GSEA (optional)

Currently, for humans, we recommend downloading the Gene Matrix Transposed (gmt) type of files from the Molecular Signatures Database (MSigDB) v7.0.

Important

Ensure that the gene IDs in your gmt-file are of the same format as your gtf- annotation file. Depending on the used identifers in the reference one needs to convert to the identifier type or use different source for your gmt-file, e.g. ConsensusPathDB.


Install Workflow

$ git clone https://gitlab.com/schmeierlab/workflows/metafunc.git

Adjust config.yaml

Currently it is required to submit the config parameters via --configfile config.yaml. Change the options in config.yaml:

  1. resultdir: Point to the directory where results of this workflow will be saved.

  2. samples: Point to the sample file. See section: Sample Sheet Format.

  3. trimming:

    1. perform: Set to True to trim reads using fastp.

    2. extra: Specify any extra parameters for fastp

  4. mapping:

    1. perform: Set to True if mapping with STAR to a host genome.

    2. index: Path to host genome index generated with STAR

    3. strandedness: Strandedness of RNA-seq data. Can be one of None, yes, and reverse

    4. star_extra: Specify any extra parameters for STAR

    5. gsea_gmt: Path to gmt-file for use in GSEA. If empty string “”, GSEA will not be performed.

    if groups are specified, the following options will be used:

    1. dgea_mincount: Minimum count threshold to retain a gene in DGEA. Default is 1.

    2. cor_pv: Maximum FDR for a gene or species to be used for correlation analysis. Default is 0.05.

    3. cor_top: Use top n DEG/DA based on FDR for correlation analysis (e.g. if set to 100, use top 100 DEG/DA for correlation analysis). Default is 100.

  5. groups:

    1. use: Indicate True if samples belong to groups/conditions that may be compared (e.g. disease vs normal).

    2. contrasts:

      contrast1-vs-contrast2 (Label referring to group comparisons to be made)

      contrast 1 (baseline group in comparisons (e.g. normal))

      contrast 2 (effect group in comparisons (e.g. with disease))

    3. DA_mincount: Minimum count threshold to retain a species for DA. Default is 1.

  6. kaijuDB:

    1. dir: Point to the Kaiju database directory. See section: Databases

    2. fmi: Specify the exact .fmi file in the dir directory.

  7. kaiju_NRGO: Point to the nrgo results directory. See section: Databases

  8. KaijuRun:

    1. kaijuOptions: Specify additional (non-default) Kaiju parameters as a string.

  9. Filters:

    1. abundance: Percent abundance cutoff for a taxid to be included in analysis (i.e. taxid should be at least equal to cutoff in at least 1 of the samples to be analyzed). Default is 0.001%.

10. TaxChoices: Give a list of all taxon names from kaiju-taxonlistEuk.tsv for which the user would like to obtain Functional/Gene Ontology summaries. Default is Viruses.

  • Taxon 1

  • Taxon 2

  • Taxon N

Sample Sheet Format

The sample sheet should have the following format:

sample

file_mate1

file_mate2

file_singles

group

Sample Name

Paired-end forward read

Paired-end reverse read

Single Read (or singleton from trimming)

Condition to test

Note

The workflow can consider different combinations of input reads. Entries in columns 2 and 3 MUST be filled together. An entry in column 4 can exist on its own or in combination with entries in columns 2 and 3. Entries in either columns 2 and 3, or column 4 are required (or all 3). If column 5 is not specified, a default value of “NoGroup” is given to all samples.


Execute Workflow

The workflow can be executed in different “modes”, depending if certain 3rd party software is installed on the system. In particular Singularity (virtualisation software) makes the workflow more reproducibile but might not be readily installed on all systems. In that case use the “conda-only mode”.

### activate snakemake environment
conda activate snakemake

### Do a dryrun of the workflow, show rules, order, and commands
$ snakemake -np --configfile config.yaml

### Recommended singularity-only mode for reproducibility
# No tool download needed. Singularity container will be pulled and used for the workflow.
# Only works on systems where Singularity has been installed.
$ snakemake -p --use-singularity --cores 12 --resources const=1 --configfile config.yaml > run.log 2>&1

# If necessary bind more folders for singularity outside of your home.
$ snakemake -p --use-singularity --cores 12 --singularity-args "--bind /mnt/disk2" --resources const=1 --configfile config.yaml > run.log 2>&1

### Conda-only run
# Will download tools and store them in environment files in the .snakemake/conda dir
$ snakemake -p --use-conda --cores 12 --resources const=1  --configfile config.yaml > run.log 2>&1

Investigate Results

Host

  1. Gene ID: Sample Tables

  2. Differential Gene Expression Analysis

  3. Gene Set Enrichment Analysis

Microbiome

  1. TaxIDTables

  2. Species Differential Abundance Analysis

  3. Functional

  4. Supplementary Files (in ``source/`` directory)

    see section: Result files in directory source for detailed description of files in source directory

  5. Log Files

    Some important information may be found in Log files

Host-Microbiome

  1. Spearman Correlation Analysis

R Shiny Visualisation

MetaFunc creates an interactive application that allows for exploration of results. For instructions on how to launch application, see R Shiny Application.