First steps are to use iinit
in terminal to connect to Data Store so you can access the fastq files Directions here
In summary, use the following when prompted:
Host name (DNS) of the server to connect to: data.cyverse.org
Port number: 1247
irods user name
irods zone: iplant
iRODS password
No need to do any installations in VICE app. Packages should come pre-installed. But need to run libraries:
library(dada2)
Loading required package: Rcpp
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
library(DECIPHER)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap,
parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call,
duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted,
lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit, which, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
Loading required package: RSQLite
cd raw_data
ls *_1.fastq.gz | cut -f 1 -d "_" > ../samples
plotQualityProfile(forward_reads[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
plotQualityProfile(reverse_reads[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
From the above you can see the reads are ~300bp and the quality is OK, with the R reads being poorer than the F reads.
Sequencing facility told me that the way that they sequence in a way that primers do not get sequenced. Still, check for and remove possible primers
Primers from Stoeck et al. 2010: TAReuk454FWD1: (5’- CCAGCASCYGCGGTAATTCC -3’)
TAReukREV3 (5’- ACTTTCGTTCTTGATYRA -3’)
(Also found a usefule website with euk primers from PR2 people:
Cut F primer and rev comp of R primer from F reads with -a option Cut R primer and rev comp of F primer from R reads with -A option
Set the min size. I tried setting it to be about 10% smaller than the amplicon length (~260) but after playing around and even going down to 200bp, it was throwing out most reads. Alternatively, I set NO min length but then it was writing empty reads to the fast1 file, which led to errors afterward. So decided on min length of 150 to compromise.
cd ..
mkdir trimmed_fastq
cd raw_data
# Run in loop
for sample in $(cat ../samples)
do
echo "On sample: $sample"
cutadapt -a ^CCAGCASCYGCGGTAATTCC...TYRATCAAGAACGAAAGT \
-A ^ACTTTCGTTCTTGATYRA...GGAATTACCGCRGSTGCTGG \
--discard-untrimmed \
-m 150 \
-o ../trimmed_fastq/${sample}_1_trimmed.fastq.gz -p ../trimmed_fastq/${sample}_2_trimmed.fastq.gz \
${sample}_1.fastq.gz ${sample}_2.fastq.gz \
>> ../trimmed_fastq/cutadapt_primer_trimming_stats.txt 2>&1
done
Check output stats
paste ../samples <(grep "passing" ../trimmed_fastq/cutadapt_primer_trimming_stats.txt | cut -f3 -d "(" | tr -d ")") <(grep "filtered" ../trimmed_fastq/cutadapt_primer_trimming_stats.txt | cut -f3 -d "(" | tr -d ")")
Retained ~30-90% of reads in most cases
setwd("~/")
list.files() # make sure what we think is here is actually here
Now let’s take a look at the trimmed reads.
forward_reads_trimmed <- paste0("eukaryote_amplicons/trimmed_fastq/", samples, "_1_trimmed.fastq.gz")
reverse_reads_trimmed <- paste0("eukaryote_amplicons/trimmed_fastq/", samples, "_2_trimmed.fastq.gz")
# And plot
plotQualityProfile(forward_reads_trimmed[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
plotQualityProfile(reverse_reads_trimmed[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
Comparing the above to the pre-trimmed reads, they look very similar but got rid of short messiness in beginning of reads (primer). Can’t see much change in end of reads but I know they were trimmed based on output file. Overall quality is too low to tell.
Make a directory for filtered reads
cd ..
mkdir filtered_fastq
Make variables containing the file names for the new filtered forward and reverse reads that we will make
filtered_forward_reads <- paste0("eukaryote_amplicons/filtered_fastq/",samples, "_1_filtered.fastq.gz")
filtered_reverse_reads <- paste0("eukaryote_amplicons/filtered_fastq/",samples, "_2_filtered.fastq.gz")
Based on how the quality plots look, determine how much to cut from each side. Trim the F reads at 275 Trim the R reads to 200 Also I want to run this step to trim out the low-quality individual reads (set maxEE to 1 for both F and R reads). The rm.phix = TRUE option removes any leftover PhiX genomic DNA (that gets added as a standard during sequencing). Pick a min length ~shorter than the min trimmed length (in this case 140 for R reads). I also set truncQ to truncate any read that has a quality score less than 2. Multithreading for this function does not work well (even according to documentation) so needed to skip that. Takes a while to run
filtered_out <- filterAndTrim(forward_reads_trimmed, filtered_forward_reads,
reverse_reads_trimmed, filtered_reverse_reads, maxEE=c(2,2),
rm.phix=TRUE, minLen=140, truncLen=c(275,200), truncQ = 2, maxN=0)
Check out the quality profiles again.
filtered_out
reads.in reads.out
SRR3735256_1_trimmed.fastq.gz 1694740 672650
SRR3735257_1_trimmed.fastq.gz 939328 379858
SRR3735258_1_trimmed.fastq.gz 848296 357038
SRR3735259_1_trimmed.fastq.gz 912372 376890
SRR3735260_1_trimmed.fastq.gz 1268530 546022
SRR3735261_1_trimmed.fastq.gz 1211714 539356
SRR3735262_1_trimmed.fastq.gz 1502614 678796
SRR3735263_1_trimmed.fastq.gz 1618546 729742
SRR3735264_1_trimmed.fastq.gz 757200 339996
SRR3735265_1_trimmed.fastq.gz 64796 26770
SRR3735266_1_trimmed.fastq.gz 502478 213672
SRR3735268_1_trimmed.fastq.gz 292884 129708
SRR3735269_1_trimmed.fastq.gz 884870 312384
SRR3735270_1_trimmed.fastq.gz 494804 222178
SRR3735271_1_trimmed.fastq.gz 258638 106318
SRR3735272_1_trimmed.fastq.gz 96490 40056
SRR3735273_1_trimmed.fastq.gz 396728 155174
SRR3735274_1_trimmed.fastq.gz 299020 122274
SRR3735275_1_trimmed.fastq.gz 744 108
SRR3735276_1_trimmed.fastq.gz 717026 286730
SRR3735277_1_trimmed.fastq.gz 557174 242266
SRR3735278_1_trimmed.fastq.gz 255692 113588
SRR3735279_1_trimmed.fastq.gz 495828 233930
SRR3735280_1_trimmed.fastq.gz 930416 397086
SRR3735281_1_trimmed.fastq.gz 376996 180378
SRR3735283_1_trimmed.fastq.gz 69796 28740
SRR3735284_1_trimmed.fastq.gz 152264 60258
SRR3735285_1_trimmed.fastq.gz 54964 21634
SRR3735286_1_trimmed.fastq.gz 9232 3298
SRR3735287_1_trimmed.fastq.gz 2006172 846262
SRR3735288_1_trimmed.fastq.gz 477210 184942
SRR3735289_1_trimmed.fastq.gz 2982 102
SRR3735290_1_trimmed.fastq.gz 257834 116018
SRR3735292_1_trimmed.fastq.gz 1271286 589140
SRR3735293_1_trimmed.fastq.gz 1445290 604964
SRR3735294_1_trimmed.fastq.gz 393670 167378
SRR3735295_1_trimmed.fastq.gz 334082 133454
SRR3735296_1_trimmed.fastq.gz 669444 309080
SRR3735297_1_trimmed.fastq.gz 1030 76
SRR3735298_1_trimmed.fastq.gz 263486 110028
SRR3735299_1_trimmed.fastq.gz 2056 128
SRR3735300_1_trimmed.fastq.gz 1383836 557764
SRR3735301_1_trimmed.fastq.gz 1094722 498002
SRR3735302_1_trimmed.fastq.gz 1030404 474520
SRR3735303_1_trimmed.fastq.gz 303198 129620
SRR3735304_1_trimmed.fastq.gz 648 94
SRR3735305_1_trimmed.fastq.gz 824464 319248
plotQualityProfile(filtered_forward_reads[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
plotQualityProfile(filtered_reverse_reads[1:5])
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing
scale.
Look much better
Save workspace up to this point.
save.image(file = "Cariaco_euk18S_dada2_upto_filterfastq.RData")
And back up in data store just in case
iput -f Cariaco_euk18S_dada2_upto_filterfastq.RData
iput -f Cariaco_Euk_PR2_DECIPHER_annotation.Rmd
iput -f samples
iput -rf filtered_fastq
iput -rf trimmed_fastq
# load("Cariaco_euk18S_dada2_upto_filterfastq.RData")
Next, DADA2 tries to learn the error signature of our dataset. This step takes a while
err_forward_reads <- learnErrors(filtered_forward_reads, multithread=TRUE)
err_reverse_reads <- learnErrors(filtered_reverse_reads, multithread=TRUE)
Plot the error profiles
plotErrors(err_forward_reads, nominalQ=TRUE)
plotErrors(err_reverse_reads, nominalQ=TRUE)
The creators of DADA2 describe this here. The profiles are the error rates for each possible transition in the read (A->C, A->G, etc). Generally in the above plots, you want to see that the black dots (observed error rates for each quality score) match well with the black lines (the estimated error rate). The red line is what is expected based on the quality score.
Backup again since this step above takes awhile Save workspace up to this point.
save.image(file = "Cariaco_euk18S_dada2_upto_errorprofile.RData")
And back up in data store just in case
iput -f Cariaco_euk18S_dada2_upto_errorprofile.RData
iput -f Cariaco_Euk_PR2_DECIPHER_annotation.Rmd
Use the dada command to infer ASVs. We are going to use the pooling option “psuedo” which is described here. This step also takes awhile
#setwd("20200710backup/") # only used this because had to resetup analysis
dada_forward <- dada(filtered_forward_reads, err=err_forward_reads, pool="pseudo", multithread=TRUE)
dada_reverse <- dada(filtered_reverse_reads, err=err_reverse_reads, pool="pseudo", multithread=TRUE)
# and save in this code block incase I get logged out again!
save.image(file = "Cariaco_euk18S_dada2_upto_inferasv.RData")
And back up in data store just in case
iput Cariaco_euk18S_dada2_upto_inferasv.RData
iput -f Cariaco_Euk_PR2_DECIPHER_annotation.Rmd
Dada2 will merge reads wherever the overlap is identical between the F and R reads. I trimmed the F reads to 275 and R reads to 200 . The full amplicon size (after removing primers) should be 963-583 = 380. So the F read should be sequenced from position ~583 to ~858 (583+275) and the R reads should be sequenced from ~ position 963 to 763 (963-200). This leaves overlap between position ~763 to 858 or about 95bp. Set minimum to about half this
#setwd("20200710backup/") # only used this because had to resetup analysis
merged_amplicons <- mergePairs(dada_forward, filtered_forward_reads, dada_reverse, filtered_reverse_reads, trimOverhang=TRUE, minOverlap=50, verbose = TRUE)
Back up again
save.image(file = "Cariaco_euk18S_upto_merge.RData")
iput HRE_sewage_microbiome_upto_merge.RData
iput -f HRE_Sewage_microbiome_DECIPHER_20200710.Rmd
Load from Data Store
load("eukaryote_amplicons/Cariaco_euk18S_upto_merge.RData")
names(merged_amplicons)
[1] "SRR3735256" "SRR3735257" "SRR3735258" "SRR3735259" "SRR3735260" "SRR3735261" "SRR3735262"
[8] "SRR3735263" "SRR3735264" "SRR3735265" "SRR3735266" "SRR3735268" "SRR3735269" "SRR3735270"
[15] "SRR3735271" "SRR3735272" "SRR3735273" "SRR3735274" "SRR3735275" "SRR3735276" "SRR3735277"
[22] "SRR3735278" "SRR3735279" "SRR3735280" "SRR3735281" "SRR3735283" "SRR3735284" "SRR3735285"
[29] "SRR3735286" "SRR3735287" "SRR3735288" "SRR3735289" "SRR3735290" "SRR3735292" "SRR3735293"
[36] "SRR3735294" "SRR3735295" "SRR3735296" "SRR3735297" "SRR3735298" "SRR3735299" "SRR3735300"
[43] "SRR3735301" "SRR3735302" "SRR3735303" "SRR3735304" "SRR3735305"
# Initially these names have the full name with `fastq.gz` in the name. Change to just sample name
names(merged_amplicons) <- samples
# Check some other things
length(merged_amplicons) # 47 elements in this list, one for each of our samples
[1] 47
class(merged_amplicons$SRR3735256) # each element of the list is a dataframe that can be accessed and manipulated like any ordinary dataframe
[1] "data.frame"
names(merged_amplicons$SRR3735256) # the names() function on a dataframe gives you the column names
[1] "sequence" "abundance" "forward" "reverse" "nmatch" "nmismatch" "nindel" "prefer"
[9] "accept"
# "sequence" "abundance" "forward" "reverse" "nmatch" "nmismatch" "nindel" "prefer" "accept"
seqtab <- makeSequenceTable(merged_amplicons)
class(seqtab) # matrix
[1] "matrix"
dim(seqtab) # 47 samples, 78182 unique ASVs
[1] 47 78182
Backup notebook
iput -f Cariaco_Euk_PR2_DECIPHER_annotation_20200827.Rmd
# though we a lot of unique sequences, we don't know if they held a lot in terms of abundance, this is one quick way to look at that
sum(seqtab.nochim)/sum(seqtab)
[1] 0.8092498
Backup again since this step above takes awhile
save.image(file = "Cariaco_euk18S_upto_chimera.RData")
iput -f Cariaco_Euk_PR2_DECIPHER_annotation_20200827_b.Rmd
iput Cariaco_euk18S_upto_chimera.RData
load("eukaryote_amplicons/Cariaco_euk18S_upto_chimera.RData")
# set a little function
getN <- function(x) sum(getUniques(x))
# making a little table
summary_tab <- data.frame(row.names=samples, dada2_input=filtered_out[,1],
filtered=filtered_out[,2], dada_f=sapply(dada_forward, getN),
dada_r=sapply(dada_reverse, getN), merged=sapply(merged_amplicons, getN),
nonchim=rowSums(seqtab.nochim),
final_perc_reads_retained=round(rowSums(seqtab.nochim)/filtered_out[,1]*100, 1))
summary_tab
Retained abot 20-40% of input reads. Most were lost after quality filtering.
Use the PR2 database (version 4.12.0) from here as input fasta for the AssignTaxonomy function. I put this in Data Store (instead of downloading here in the notebook because didn’t have github tools installed here). Syntax needs some small changes compared to using Silva databases. Dada2 developers describe those here.
taxa <- assignTaxonomy(seqtab.nochim, "eukaryote_amplicons/pr2_version_4.12.0_18S_dada2.fasta.gz", taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"), multithread=TRUE)
Check
rownames(taxa.print) <- NULL
head(taxa.print)
Kingdom Supergroup Division Class Order
[1,] "Eukaryota" "Alveolata" "Dinoflagellata" "Dinophyceae" "Gymnodiniales"
[2,] "Eukaryota" "Rhizaria" "Cercozoa" "Filosa-Thecofilosea" "Cryomonadida"
[3,] "Eukaryota" "Alveolata" "Dinoflagellata" "Syndiniales" "Dino-Group-I"
[4,] "Eukaryota" "Alveolata" "Dinoflagellata" "Syndiniales" "Dino-Group-II"
[5,] "Eukaryota" "Alveolata" "Dinoflagellata" "Syndiniales" "Dino-Group-II"
[6,] "Eukaryota" "Stramenopiles" "Ochrophyta" "Bacillariophyta" "Bacillariophyta_X"
Family Genus Species
[1,] "Gymnodiniaceae" "Gymnodinium" "Gymnodinium_sp."
[2,] "Cryomonadida_X" "Cercozoa-01" "Cercozoa-01_sp."
[3,] "Dino-Group-I-Clade-1" "Dino-Group-I-Clade-1_X" "Dino-Group-I-Clade-1_X_sp."
[4,] "Dino-Group-II_X" "Dino-Group-II_XX" "Dino-Group-II_XX_sp."
[5,] "Dino-Group-II-Clade-14" "Dino-Group-II-Clade-14_X" "Dino-Group-II-Clade-14_X_sp."
[6,] "Polar-centric-Mediophyceae" "Chaetoceros" "Chaetoceros_socialis_debilis"
#giving our seq headers more manageable names (ASV_1, ASV_2...)
asv_seqs <- colnames(seqtab.nochim)
asv_headers <- vector(dim(seqtab.nochim)[2], mode="character")
for (i in 1:dim(seqtab.nochim)[2]) {
asv_headers[i] <- paste(">ASV", i, sep="_")
}
# making and writing out a fasta of our final ASV seqs:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
write(asv_fasta, "ASVs.fa")
# count table:
asv_tab <- t(seqtab.nochim)
row.names(asv_tab) <- sub(">", "", asv_headers)
write.table(asv_tab, "ASVs_counts.tsv", sep="\t", quote=F, col.names=NA)
# Taxonomy table
asv_tax <- taxa
row.names(asv_tax) <- sub(">", "", asv_headers)
write.table(asv_tax, "ASVs_taxonomy.tsv", sep="\t", quote=F, col.names=NA)
Backup and save to Data Store Save workspace up to this point.
save.image(file = "Cariaco_euk18S_final.RData")
And back up in data store just in case
iput Cariaco_euk18S_final.RData
iput -f Cariaco_Euk_PR2_DECIPHER_annotation.Rmd
iput ASVs.fa
iput ASVs_counts.tsv
iput ASVs_taxonomy.tsv