Details of pipeline results

Microbiome

Results Directory Structure

analyses/results
└── microbiome
        ├── source
        │   ├── classification
        │   ├── protein
        │   ├── taxon
        │   └── go
        │       ├── ALL
        │       │    └── {taxon}
        │       ├── per_sample
        │       │    └── {taxon}_samples
        │       └── per_group
        │            └── {taxon}_grouped
        ├── taxonomy
        │    ├── per_sample
        │    ├── per_group
        │    │    └── DA
        │    │         └── {contrasts}
        │    └── ALL
        │
        └── function
             ├── per_sample
             ├── per_group
             └── ALL
  • ‘{}’ indicates wildcards (e.g. {contrasts} can be the contrast1-vs-contrast2 that was specified in the Adjust config.yaml file)

Result files in directory taxonomy

These are the main result files as summarized from the species classified in tables in Directory: source/taxon/. TaxIDs are excluded if they are below a specified cutoff (default: 0.001% abundance in a sample) for all samples in a dataset. Note that scaled reads used are still against the total number of reads classified at species level, including species filtered out.

Result files in this directory have the following format:

TaxID

Kingdom

Phylum

Class

Order

Family

Genus

Species

RootTaxon

Sample1 1

Sample 2

Sample N

TaxID 1

###2

###

###

###

TaxID 2

###

###

###

###

TaxID 3

###

###

###

###

TaxID N

###

###

###

###

1or Group/ALL: refers to individual samples or group you specify. Another directory designated as ALL contains all samples that have been combined as one group. Grouped outputs are found in per_group/grouped_sptable_pct.tsv or ALL/ALL_spTable_pct.tsv, and individual outputs are found in per_sample/all_sptable.tsv or per_sample/all_sptableScaled.tsv files.

2Raw or Scaled Read Counts: These are read counts per taxonomy ID. For individual outputs, raw and scaled read counts are given as separate tables. Scaled read counts for each sample are obtained by dividing read counts for each taxonomy ID by the total number of reads mapped at the species level in the sample and multiplying this quotient by 100. For grouped outputs, scaled read counts calculated per sample are averaged across all members of a group. Raw counts are not given per group as we deem this not informative especially if group sizes are not equal.

Differential Abundance Tables

Differential abundance analyses result tables can be found in the per_group/DA/{contrasts}/diffab.tsv.gz file, where {contrasts} is the contrast/s indicated in the config file.

This file is the result of differential abundance analyses of species carried out using edgeR’s exact test with adjusted p-values using FDR method.

Row names have the following format: TaxID_TaxonName.


Result files in directory function

These are the main result files as summarized from tables in Directory: source/go/.

Result files in this directory have the following format:

GO_ID

Description

GO_Namespace

Sample1 1

Sample 2

Sample N

GO_ID 1

###2

###

###

###

GO_ID 2

###

###

###

###

GO_ID N

###

###

###

###

1or Group/ALL: refers to individual samples or samples joined per group you specify. Another directory designated as ALL contains all samples that have been combined as one group. Grouped outputs are found in per_group/{taxon}_grouped_goPercent.tsv files or ALL/{taxon}_ALLasGroup_goPercent.tsv files, and individual outputs are found in per_sample/{taxon}_samples_goPercent.tsv files.

2percent of GO_ID: Only proteins from taxonomies passing an abundance cutoff are included. Only species-level reads are considered. Note that scaled reads used are still against the total number of reads classified at species level, including species filtered out. For each GO_ID, the scaled proportional read counts of proteins annotated with the GO_ID are summed up. The sum per GO_ID is then divided by the total scaled proportional read counts of all GO_IDs belonging to a namespace (biological_process, molecular_function, cellular_component) then multiplied by 100 to get the percentage (see Directory: source/go/ ‘s sp_Percent). For grouped samples, the scaled proportional read counts of the protein accessions are first averaged among members of a specified group. This average count is used to sum up the GO_IDs, then the same procedure is used to get the final percent.


Result files in directory source

Directory: source/taxon/

These are Taxonomy ID-based tables per sample with the suffix taxid.tsv. These have the following columns:

  1. TaxID: The Taxonomy ID that reads are classified to based on their protein matches.

  2. Taxon: Taxonomy Name of the Taxonomy ID.

  3. Rank: (of TaxID/Taxon)

  4. RootTaxon: Root Kingdom/Taxon of the classification. Kingdom was not used because microbial eukaryotes have ‘Eukaryota’ as kingdom but something else needs to distinguish the root taxon from this term.

  5. Accession Number: Accession numbers of the protein matches of the reads classified under the TaxID.

  6. Number_of_reads: number of reads that matched to proteins under that TaxID (note that Kaiju uses protein matches for taxonomy classification).

  7. Number_of_reads(scaled): Number of reads divided by total classified reads, multiplied by 100 (equivalent to percent %). Classified reads refer to any read that has a Taxonomy ID classification under Kaiju AND belongs to a user-specified taxon list of root taxon/kingdom. Default for this pipeline is kaiju-taxonlistEuk.tsv.

  8. Number_of_reads(sp_scaled): Refers to reads of species level only, divided by total number of species level reads, multiplied by 100.

  9. Number_of_uniquely_matching_reads: Kaiju may match to more than 1 accession number (even if these multiple accession numbers match to just 1 taxon). This refers to reads that matched to only 1 protein for that TaxID.

  10. Number_of_uniquely_matching_reads(scaled): Species level only. Sum of all uniquely matching reads for the species level, divided by the total number of uniquely matching reads for species level, multiplied by 100.

Note

Taxonomy ID-based tables have the following last rows:

  1. Classified_Total: Sum of all reads that have been classified by Kaiju AND belong to the user-specified taxon list of root taxon/kingdom.

  2. Others: All reads that have a Kaiju classification, but does not belong to the user-specified taxon list of root taxon/kingdom. For the default taxon list this could be ‘cellular organisms’, or ‘root’ (among others).

  3. Unclassified: Sum of all reads that Kaiju cannot classify.

Directory: source/protein/

These are accession number-based tables per sample with the suffix protacc.tsv. They have the following columns:

  1. ProteinAccession: Protein Accession Number. Note that there could be “identical” accession numbers but under different ranks.

  2. Number_of_reads: Number of reads that matched to the accession number. If a read has matched to multiple accession numbers, a count will be given to each accession number.

  3. Number_of_reads(scaled): Number of reads divided by total classified reads, multiplied by 100. Equivalent to getting scaled value of a read first before adding up to the protein total. Since some reads match to more than 1 protein, will not add up to 100 in total.

  4. Number_of_reads(sp_scaled): Refers to reads of species level only, divided by total number of species level reads, multiplied by 100.

  5. Proportional_Reads: each read is divided by the number of protein accessions it has matched to. This count is added to towards all “proportional” read counts of each accession number.

  6. Proportional_Reads(scaled): Proportional Reads divided by total classified reads, multiplied by 100 (all Proportional Reads should add up to total number of classified reads).

  7. Proportional_Reads(sp_scaled): Refers to proportional reads of species level only, divided by total number of species level reads, multiplied by 100.

  8. Number_of_uniquely_matching_reads: Number of reads that matched to that accession number uniquely.

  9. Number_of_uniquely_matching_reads(scaled): Species level only. Sum of all uniquely matching reads under the species level, divided by the total number of uniquely matching reads for species level, multiplied by 100.

  10. Associated_TaxIDs: The taxonomy IDs the accession number is associated with.

  11. Associated_TaxNames: Taxonomy names of IDs in column (10).

  12. RootTaxon: root Kingdom/Taxa of the Taxonomy ID.

  13. Rank: Rank of the tax ID classification of the read based on the accession numbers.

Directory: source/go/

Results files in this directory are obtained by parsing through species level proteins from tables in Directory: source/protein/. For each sample or group/condition and for separate taxa, there are two tables generated: One for Gene Ontology (GO) annotations of the proteins and one ‘none’ file containing information about protein accession numbers without GO annotations. In addition, there is a directory that contains the above tables for all combined taxa indicated in the kaiju-taxonlistEuk.tsv file.

Another directory desigated as ALL contains all samples that have been combined as one group. It generates the same tables as above for each taxa indicated in the config-file (e.g. config.yaml), as well as for all taxa indicated in kaiju-taxonlistEuk.tsv combined.

Note

Only proteins of taxonomy IDs that passed the cutoff specified in the config-file are included in these tables. Scaled reads used are still scaled against total classified reads, including filtered species (see Directory: source/protein/’s items 6 and 7).

GO-based tables

The GO-based tables per sample/group has the following columns:

  1. GO_ID: The Gene Ontology ID Number.

  2. Description: Description/Name of the GO_ID.

  3. GO_NameSpace: One of biological_process, molecular_function, and cellular_component.

  4. Proportional_reads: Sum of the proportional reads of all protein accessions annotated with the specific GO_ID. If within groups, the average of the proportional reads matching to a protein accession number among samples in a group is used.

  5. Proportional_reads(scaled): Sum of the scaled proportional reads of all protein accession numbers annotated with the GO_ID. If within groups, the average of the scaled proportional reads matching to a protein accession number among samples in a group is used.

  6. Percent: Percent of (scaled) proportional reads from (5) covering the GO_ID, calculated as described in Description of the Process to Associate Function.

  7. Proportional_reads(sp_scaled): Sum of the proportional reads (scaled at the species level) of all protein accession numbers annotated with the GO_ID. Note that this only covers reads classified at the species level. If within groups, the average of the species scaled reads matching to a protein accession number among samples in a group is used.

  8. sp_Percent: Percent of (species-scaled) proportional reads from (7) covering the GO_ID, calculated as described in Description of the Process to Associate Function.

  9. GO_Depth: Depth of the GO_ID in the DAG.

  10. Accession_Associated: Protein accession numbers annotated with the GO_ID.

  11. Number_of_proteins: Number of protein accessions annotated with the GO_ID (count of accessions in 10).

  12. Associated_TaxIDs: TaxIDs of the proteins annotated with the GO_ID.

  13. Number_of_TaxIDs: Number of the TaxIDs of the proteins annotated with the GO_ID (count of TaxIDs in 12).

  14. group or sample: The source group or sample for the annotated proteins.

“none” tables

The none tables contain the following columns:

  1. Accession: The protein accession number without a GO annotation.

  2. Reason: (Reason why there is no annotation) ‘Accession does not have GO annotations’ wherein the accession number is not in the sqlite database and therefore has no GO annotations.

  3. Associated_TaxIDs: TaxIDs of the protein accession.

  4. Number_of_Associated_TaxIDs: Number of TaxIDs in (3).

  5. Proportional_reads: proportional read counts of the accession number (for groups, average of the read counts of the accession number among samples in the same group).

  6. Proportional_reads(scaled): scaled proportional read counts of the accession number (for groups, average of the scaled read counts of the accession number among samples in the same group).

  7. Proportional_reads(sp_scaled): proportional read counts of the accession number scaled to the number of reads classified at the species level (for groups, average of the scaled read counts of the accession number among samples in the same group). Note that this only refers to reads classified at species level.

  8. group or sample: The group or sample from which the proteins in (1) come from.


Host

Results directory Structure

analyses/results
    └── host
         ├── expression
            └── fc (feature counts summary directory)
         └──contrasts
             └── {contrasts}

Result files in host/expression

Files: all.counts.tsv or all.tpm.tsv

These file summarize the host gene expression on a count (all.counts.tsv) or tags-per-million (all.tpm.tsv) basis. The structure of the tab-separated tables is simple:

gene

Sample1 1

Sample 2

Sample N

gene 1

###2

###

###

###

gene 2

###

###

###

###

gene N

###

###

###

###

1Samples: refers to individual samples

2Read Counts: Read counts in raw counts (all.counts.tsv) or in tags-per-million (all.tpm.tsv)


Differential Gene Expression Analysis (DGEA)

File: contrasts/{contrasts}/diffexp.tsv.gz

DGEA results per host gene are found in file contrasts/{contrasts}/diffexp.tsv.gz, where {contrasts} is the contrast indicated in the config file.

This file contains the result of DGEA carried out using edgeR’s exact test with adjusted p-values using the FDR method.


Gene Set Enrichment Analysis (GSEA)

These are result outputs after GSEA analysis using clusterProfiler.

File: contrasts/{contrasts}/diffexp.gsea-up.tsv.gz

These are gene sets with positive enrichment scores and enriched in the test group (i.e. contrast 2 in the config file).

File: contrasts/{contrasts}/diffexp.gsea-dn.tsv.gz

These are gene sets with negative enrichment scores and enriched in the baseline group (i.e. contrast 1 in the config file).


Host - Microbiome

analyses/results
    └── host - microbiome
         └──correlation
             └── {contrasts}

Result files in host-microbiome/correlation

Note

Only included if DGEA for the host is run and done for genes and species found to be differentially expressed/abundant. Provided on a per contrast basis.

File: {contrasts}/cor.deg-tax.tsv.gz

Correlation table between top DEGs and top DA species for contrast {contrasts} using spearmanr of Python SciPy.

geneid

taxid

species

rho

pvalue

siglevel

geneid 1

taxid 1

species 1

###

###

significance1

geneid 1

taxid 2

species 2

###

###

geneid 2

taxid 1

species 1

###

###

geneid 2

taxid 2

species 2

###

###

geneid N

taxid N

species N

###

###

1siglevel:

  • * is p-value < 0.05

  • ** is p-value < 0.01

  • *** is p-value < 0.001

File: {contrasts}/cor.deg-tax.matrix.tsv

Correlation matrix between top DEG and top DA species for contrast {contrasts}. Note that all pairwise correlation analyses are included, NOT only those with significant values.

ID

Taxon 1

Taxon 2

Taxon N

gene 1

###1

###

###

###

gene 2

###

###

###

###

gene N

###

###

###

###

1Correlation: rho values from the cor.deg-tax.tsv.gz table