Link to notebook
Link to github repo.
library(dada2)
library(DECIPHER)
library(phyloseq)
library(taxonomizr)
(Run in bash Terminal)
cd Raw_data
ls *R1_001.fastq.gz | cut -f 1-2 -d "_" > ../samples
## import sample names as R variable
samples <- scan("samples", what="character")
Read 55 items
# make variable holding the file names of all the forward read fastq files. These are in a subdirectory, so I am also adding the name of the sub directory to the file name
forward_reads <- paste0("Raw_data/", samples, "_L001_R1_001.fastq.gz")
# and one with the reverse
reverse_reads <- paste0("Raw_data/", samples, "_L001_R2_001.fastq.gz")
# And plot using a tool from dada2 (checking only 5 samples for plotting purposes)
plotQualityProfile(forward_reads[1:5])
From the above you can see the reads are 150bp and the quality is good, with the R reads being poorer than the F reads.
Primers are universal MiFISh primers from Miya et al. 2015. Amplify a subregion of 12S rRNA that is 163-185 bp long.
MiFISH-U-F, 5’-3’: GTCGGTAAAACTCGTGCCAGC [rev comp: GCTGGCACGAGTTTTACCGAC]
MiFISH-U-R, 5’-3’: CATAGTGGGGTATCTAATCCCAGTTTG [rev comp: CAAACTGGGATTAGATACCCCACTATG]
Cut F primer from F reads with -g option
Cut rev comp of R primer from F reads with -a option
Cut R primer from R reads with -G option
Cut rev comp of F primer from R reads with -A option
Run cutadapt to remove primers. Use min length of 100bp (bash)
cd ..
mkdir trimmed_fastq
cd Raw_data
# Run in loop
for sample in $(cat ../samples)
do
echo "On sample: $sample"
cutadapt -g GTCGGTAAAACTCGTGCCAGC \
-a CAAACTGGGATTAGATACCCCACTATG \
-G CATAGTGGGGTATCTAATCCCAGTTTG \
-A GCTGGCACGAGTTTTACCGAC \
-m 100 \
--discard-untrimmed \
-o ../trimmed_fastq/${sample}_L001_R1_001_trimmed.fastq.gz -p ../trimmed_fastq/${sample}_L001_R2_001_trimmed.fastq.gz \
${sample}_L001_R1_001.fastq.gz ${sample}_L001_R2_001.fastq.gz \
>> ../trimmed_fastq/cutadapt_primer_trimming_stats.txt 2>&1
done
Check output, how many were filtered out after cutadapt (bash)
paste ../samples <(grep "passing" ../trimmed_fastq/cutadapt_primer_trimming_stats.txt | cut -f3 -d "(" | tr -d ")") <(grep "filtered" ../trimmed_fastq/cutadapt_primer_trimming_stats.txt | cut -f3 -d "(" | tr -d ")")
Output:
T1PosCon_S11 89.4% 80.4%
T1S10_S9 81.4% 76.6%
T1S11_S10 97.5% 83.1%
T1S1_S1 97.9% 83.2%
T1S2_S2 68.4% 68.9%
T1S3_S3 85.3% 78.9%
T1S5_S4 91.6% 81.1%
T1S6_S5 85.5% 78.4%
T1S7_S6 83.3% 78.1%
T1S8_S7 98.4% 83.1%
T1S9_S8 94.6% 82.3%
T2PosCon_S21 99.4% 83.5%
T2S10_S19 69.5% 71.7%
T2S11_S20 90.5% 80.7%
T2S1_S12 68.5% 70.6%
T2S2_S13 97.8% 83.1%
T2S3_S14 98.1% 83.2%
T2S4_S15 87.3% 79.8%
T2S5_S16 78.9% 75.9%
T2S6_S17 92.2% 81.3%
T2S9_S18 98.9% 83.5%
T3PosCon_S33 61.0% 67.4%
T3S10_S31 90.7% 80.8%
T3S11_S32 64.6% 69.7%
T3S1_S22 70.2% 72.8%
T3S2_S23 97.5% 82.3%
T3S3_S24 33.2% 45.0%
T3S4_S25 82.3% 77.7%
T3S5_S26 74.4% 74.4%
T3S6_S27 94.1% 82.0%
T3S7_S28 88.1% 79.7%
T3S8_S29 97.0% 82.7%
T3S9_S30 95.8% 82.6%
T4S10_S43 89.3% 80.2%
T4S11_S44 90.3% 80.2%
T4S1_S34 97.4% 83.0%
T4S2_S35 77.8% 76.2%
T4S3_S36 96.4% 82.6%
T4S4_S37 79.2% 74.6%
T4S5_S38 93.5% 79.5%
T4S6_S39 88.9% 79.8%
T4S7_S40 92.9% 81.6%
T4S8_S41 91.1% 81.0%
T4S9_S42 77.7% 76.2%
T5S10_S54 92.7% 81.3%
T5S11_S55 84.4% 78.4%
T5S1_S45 95.3% 82.1%
T5S2_S46 96.3% 82.2%
T5S3_S47 97.8% 83.0%
T5S4_S48 96.4% 82.4%
T5S5_S49 83.8% 78.2%
T5S6_S50 89.2% 79.8%
T5S7_S51 78.3% 75.1%
T5S8_S52 96.3% 82.3%
T5S9_S53 95.3% 82.2%
So it retained ~60-99% of reads. Looking at the cutadapt_primer_trimming_stats.txt
output file, you can see the F and R primers were trimmed in almost all cases and the rev comp of each primer was trimmed in many cases too.
Go back to R console and take a look at the trimmed reads.
forward_reads_trimmed <- paste0("trimmed_fastq/", samples, "_L001_R1_001_trimmed.fastq.gz")
reverse_reads_trimmed <- paste0("trimmed_fastq/", samples, "_L001_R2_001_trimmed.fastq.gz")
# And plot
plotQualityProfile(forward_reads_trimmed[1:5])
plotQualityProfile(reverse_reads_trimmed[1:5])
Comparing the above to the pre-trimmed reads, they look very similar because only very few had primers.
Make a directory for filtered reads (bash)
cd ..
mkdir filtered_fastq
Make variables containing the file names for the new filtered forward and reverse reads that we will make
filtered_forward_reads <- paste0("filtered_fastq/",samples, "_L001_R1_001_filtered.fastq.gz")
filtered_reverse_reads <- paste0("filtered_fastq/",samples, "_L001_R2_001_filtered.fastq.gz")
Based on how the quality plots look, determine how much to cut from each side. Trim the F reads at 130 Trim the R reads to 120 Also I want to run this step to trim out the low-quality individual reads (set maxEE to 1 for both F and R reads). The rm.phix = TRUE option removes any leftover PhiX genomic DNA (that gets added as a standard during sequencing). Pick a min length ~shorter than the min trimmed length (in this case 100 for R reads). I also set truncQ to truncate any read that has a quality score less than 2. Multithreading for this function does not work well (even according to documentation) so needed to skip that. Takes awhile to run.
filtered_out <- filterAndTrim(forward_reads_trimmed, filtered_forward_reads,
reverse_reads_trimmed, filtered_reverse_reads, maxEE=c(2,2),
rm.phix=TRUE, minLen=100, truncLen=c(130,120), truncQ = 2, maxN=0)
Check out the quality profiles again.
filtered_out
reads.in reads.out
T1PosCon_S11_L001_R1_001_trimmed.fastq.gz 116280 110631
T1S10_S9_L001_R1_001_trimmed.fastq.gz 38054 35699
T1S11_S10_L001_R1_001_trimmed.fastq.gz 114528 106959
T1S1_S1_L001_R1_001_trimmed.fastq.gz 92897 86350
T1S2_S2_L001_R1_001_trimmed.fastq.gz 3246 2886
T1S3_S3_L001_R1_001_trimmed.fastq.gz 100412 95273
T1S5_S4_L001_R1_001_trimmed.fastq.gz 54866 51337
T1S6_S5_L001_R1_001_trimmed.fastq.gz 56588 53872
T1S7_S6_L001_R1_001_trimmed.fastq.gz 30703 1120
T1S8_S7_L001_R1_001_trimmed.fastq.gz 39544 1423
T1S9_S8_L001_R1_001_trimmed.fastq.gz 122522 115634
T2PosCon_S21_L001_R1_001_trimmed.fastq.gz 30602 28533
T2S10_S19_L001_R1_001_trimmed.fastq.gz 71843 66868
T2S11_S20_L001_R1_001_trimmed.fastq.gz 89790 50910
T2S1_S12_L001_R1_001_trimmed.fastq.gz 31282 28923
T2S2_S13_L001_R1_001_trimmed.fastq.gz 125120 115914
T2S3_S14_L001_R1_001_trimmed.fastq.gz 69969 65473
T2S4_S15_L001_R1_001_trimmed.fastq.gz 103877 95461
T2S5_S16_L001_R1_001_trimmed.fastq.gz 49491 45322
T2S6_S17_L001_R1_001_trimmed.fastq.gz 71976 65004
T2S9_S18_L001_R1_001_trimmed.fastq.gz 94811 87795
T3PosCon_S33_L001_R1_001_trimmed.fastq.gz 58777 55669
T3S10_S31_L001_R1_001_trimmed.fastq.gz 61472 49297
T3S11_S32_L001_R1_001_trimmed.fastq.gz 26605 1017
T3S1_S22_L001_R1_001_trimmed.fastq.gz 47808 40812
T3S2_S23_L001_R1_001_trimmed.fastq.gz 8852 7454
T3S3_S24_L001_R1_001_trimmed.fastq.gz 15729 11088
T3S4_S25_L001_R1_001_trimmed.fastq.gz 66149 47754
T3S5_S26_L001_R1_001_trimmed.fastq.gz 40310 19760
T3S6_S27_L001_R1_001_trimmed.fastq.gz 82355 75931
T3S7_S28_L001_R1_001_trimmed.fastq.gz 95225 86068
T3S8_S29_L001_R1_001_trimmed.fastq.gz 19744 16024
T3S9_S30_L001_R1_001_trimmed.fastq.gz 69446 53578
T4S10_S43_L001_R1_001_trimmed.fastq.gz 106101 96830
T4S11_S44_L001_R1_001_trimmed.fastq.gz 55349 51615
T4S1_S34_L001_R1_001_trimmed.fastq.gz 54676 43731
T4S2_S35_L001_R1_001_trimmed.fastq.gz 73007 66851
T4S3_S36_L001_R1_001_trimmed.fastq.gz 37373 34194
T4S4_S37_L001_R1_001_trimmed.fastq.gz 61890 47600
T4S5_S38_L001_R1_001_trimmed.fastq.gz 1860 1425
T4S6_S39_L001_R1_001_trimmed.fastq.gz 57411 52148
T4S7_S40_L001_R1_001_trimmed.fastq.gz 64167 56900
T4S8_S41_L001_R1_001_trimmed.fastq.gz 90353 85431
T4S9_S42_L001_R1_001_trimmed.fastq.gz 86928 80022
T5S10_S54_L001_R1_001_trimmed.fastq.gz 52707 38072
T5S11_S55_L001_R1_001_trimmed.fastq.gz 55188 46556
T5S1_S45_L001_R1_001_trimmed.fastq.gz 77878 73943
T5S2_S46_L001_R1_001_trimmed.fastq.gz 38058 34351
T5S3_S47_L001_R1_001_trimmed.fastq.gz 136287 128690
T5S4_S48_L001_R1_001_trimmed.fastq.gz 62588 57790
T5S5_S49_L001_R1_001_trimmed.fastq.gz 64785 60236
T5S6_S50_L001_R1_001_trimmed.fastq.gz 32802 29323
T5S7_S51_L001_R1_001_trimmed.fastq.gz 59603 51304
T5S8_S52_L001_R1_001_trimmed.fastq.gz 58354 53312
T5S9_S53_L001_R1_001_trimmed.fastq.gz 109768 100374
plotQualityProfile(filtered_forward_reads[1:5])
plotQualityProfile(filtered_reverse_reads[1:5])
These look better. Many were retained before and after quality trimming:
Save workspace up to this point.
save.image(file = "upto_filterfastq.RData")
Run if you come back and need to reload dataset
load("upto_filterfastq.RData")
Next, DADA2 tries to learn the error signature of our dataset. This step takes a while
err_forward_reads <- learnErrors(filtered_forward_reads, multithread=TRUE)
err_reverse_reads <- learnErrors(filtered_reverse_reads, multithread=TRUE)
Plot the error profiles
plotErrors(err_forward_reads, nominalQ=TRUE)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
plotErrors(err_reverse_reads, nominalQ=TRUE)
The creators of DADA2 describe this here. The profiles are the error rates for each possible transition in the read (A->C, A->G, etc). Generally in the above plots, you want to see that the black dots (observed error rates for each quality score) match well with the black lines (the estimated error rate). The red line is what is expected based on the quality score.
Backup again since this step above takes awhile
Save workspace up to this point.
save.image(file = "upto_errorprofile.RData")
Use the dada command to infer ASVs. This step also takes awhile
Stoeckle et al. use the loess error model for error estimation, which is the default so I didn’t set that here. They also set selfConsist = true so that “the error-model was independently built for each sample.” They also turn pooling off (the default) and said that they “build an error model using a subset of the total reads from a sequencing run and provide this model to DADA2.” However, in their code, they set the error model as err=inflateErr(tperr1,3)
which uses an error matrix, tperr1
, from a mock community. This is used as an example in the dada2 documentation but I don’t think it should be used for real samples. Really they should have generated the model from their own data (as above). So I am keeping my own modifications here and not theirs. This error model is generated from the samples themselves, with pseudopooling across samples. According to the developers: “pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples.” The dada2 package offers two types of pooling, complete sample pooling (which is computationally expensive) and pseudo-pooling, which gives the benefit of pooling but is not as intensive. This is further described at the developer’s site.
dada_forward <- dada(filtered_forward_reads, err=err_forward_reads, pool="pseudo", multithread=TRUE)
dada_reverse <- dada(filtered_reverse_reads, err=err_reverse_reads, pool="pseudo", multithread=TRUE)
Backup again since this step above takes awhile
Save workspace up to this point.
save.image(file = "upto_inferasv.RData")
Dada2 will merge reads wherever the overlap is identical between the F and R reads. I trimmed the F reads to 130 and R reads to 120 (150). The full amplicon size (based on primers) should be 163-185. Being conservative, use a length of 185. This means the F read is from position 1 to 130 and the R read is from position 185 to 65, leaving a region of overlap between position 65 and 130, or 65 total bp.
Since this is an estimate, leave a little wiggle room and set the minimum overlap to 30bp. Also set trimOverhang to true, which makes sure that a read doesn’t go past its opposite primer (which probably wouldn’t happpen any way due to trimming).
merged_amplicons <- mergePairs(dada_forward, filtered_forward_reads, dada_reverse, filtered_reverse_reads, trimOverhang=TRUE, minOverlap=30, verbose = TRUE)
names(merged_amplicons)
[1] "T1PosCon_S11" "T1S10_S9" "T1S11_S10" "T1S1_S1" "T1S2_S2" "T1S3_S3" "T1S5_S4"
[8] "T1S6_S5" "T1S7_S6" "T1S8_S7" "T1S9_S8" "T2PosCon_S21" "T2S10_S19" "T2S11_S20"
[15] "T2S1_S12" "T2S2_S13" "T2S3_S14" "T2S4_S15" "T2S5_S16" "T2S6_S17" "T2S9_S18"
[22] "T3PosCon_S33" "T3S10_S31" "T3S11_S32" "T3S1_S22" "T3S2_S23" "T3S3_S24" "T3S4_S25"
[29] "T3S5_S26" "T3S6_S27" "T3S7_S28" "T3S8_S29" "T3S9_S30" "T4S10_S43" "T4S11_S44"
[36] "T4S1_S34" "T4S2_S35" "T4S3_S36" "T4S4_S37" "T4S5_S38" "T4S6_S39" "T4S7_S40"
[43] "T4S8_S41" "T4S9_S42" "T5S10_S54" "T5S11_S55" "T5S1_S45" "T5S2_S46" "T5S3_S47"
[50] "T5S4_S48" "T5S5_S49" "T5S6_S50" "T5S7_S51" "T5S8_S52" "T5S9_S53"
# Initially these names have the full name with `fastq.gz` in the name. Change to just sample name
names(merged_amplicons) <- samples
# Check some other things
length(merged_amplicons) # one for each of our samples
[1] 55
class(merged_amplicons$T1PosCon_S11) # each element of the list is a dataframe that can be accessed and manipulated like any ordinary dataframe
[1] "data.frame"
names(merged_amplicons$T1PosCon_S11) # the names() function on a dataframe gives you the column names
[1] "sequence" "abundance" "forward" "reverse" "nmatch" "nmismatch" "nindel" "prefer" "accept"
# "sequence" "abundance" "forward" "reverse" "nmatch" "nmismatch" "nindel" "prefer" "accept"
Back up again
save.image(file = "upto_merge.RData")
seqtab <- makeSequenceTable(merged_amplicons)
class(seqtab) # matrix
[1] "matrix" "array"
dim(seqtab) # 55 samples, 3512 unique ASVs
[1] 55 3512
seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=TRUE)
Identified 3121 bimeras out of 3512 input sequences.
# Identified 3121 bimeras out of 3512 input sequences.
# though we got a lot of unique sequences, we don't know if they held a lot in terms of abundance, this is one quick way to look at that
sum(seqtab.nochim)/sum(seqtab)
[1] 0.9457455
# 0.9457455
# good, we barely lost any in terms of abundance. That means the chimeras were very low abundance ASVs
dim(seqtab.nochim) # 55 samples, 391 unique ASVs remain
[1] 55 391
Backup again since this step above takes awhile Save workspace up to this point.
save.image(file = "upto_chimera.RData")
# set function from Happy Belly
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
getN <- function(x) sum(getUniques(x))
# making a little table
summary_tab <- data.frame(row.names=samples, dada2_input=filtered_out[,1],
filtered=filtered_out[,2], dada_f=sapply(dada_forward, getN),
dada_r=sapply(dada_reverse, getN), merged=sapply(merged_amplicons, getN),
nonchim=rowSums(seqtab.nochim),
final_perc_reads_retained=round(rowSums(seqtab.nochim)/filtered_out[,1]*100, 1))
summary_tab
In the end, we retained abot 80-95% of input reads after the filtering steps. (Except 3 samples that didn’t merge at all.)
Next, follow some code from Stoeckle et al. for making writeFasta
function and exporting the “OTU” table and fasta file.
(Note this is not at “OTU” table because these are ASVs)
# Construct sequence table
table(nchar(colnames(seqtab.nochim)))
114 116 118 122 137 138 139 140 162 164 167 168 169 170 171 172 173 175 180 181 215 217
1 1 1 1 2 1 2 3 2 2 4 32 169 25 9 43 85 1 2 2 2 1
length(unique(substr(colnames(seqtab.nochim), 1, 100)))
[1] 227
dim(seqtab.nochim)
[1] 55 391
# Make "otu_table"
seqs <- colnames(seqtab.nochim)
otab <- otu_table(seqtab.nochim, taxa_are_rows=FALSE)
colnames(otab) <- paste0("Seq_", seq(ncol(otab)))
#Write fastas to test file
writeFasta <- function(seqs, output) {
seqsout <- mapply( function(idx, sequence) paste0(">Seq_",idx,"\n",sequence,"\n"),
seq(length(seqs)),
seqs)
write(paste0(seqsout), file = output, sep = "")
}
seqs_for_blast <- DNAStringSet(seqs)
names(seqs_for_blast) <- sapply(seq(length(seqs)),function(x) {paste0("Seq_",x)})
# Write the fasta sequences and the OTU table
writeFasta(seqs, "results/tax_sequences.fasta")
write.table(otab, file="results/otutable.csv", col.names = NA)
I am using blastn function in terminal, which can be installed with the blast package from NCBI using conda conda install blast
. I already had it installed but was having trouble bc v2.9 does not work on remote host (according to this). So I downgraded my blast to v2.6 with conda install -c bioconda/label/cf201901 blast
Use -max_target_seqs of 1 for now. Only puts top hit in the file
blastn -query results/tax_sequences.fasta \
-db nt \
-out results/tax_sequences_blast.txt \
-remote \
-max_target_seqs 1 \
-outfmt "6 qseqid sseqid pident length mismatch evalue bitscore staxids stitle"
The headings of the table are the following parameters from NCBI:
- seqid: query (e.g., unknown gene) sequence id
- sseqid: subject (e.g., reference genome) sequence id
- pident: percentage of identical matches
- length: alignment length (sequence overlap)
- mismatch: number of mismatches
- evalue: expect value
- bitscore: bit score
- staxids: Subject Taxonomy ID(s), separated by a ‘;’
- stitle: Subject Title
I found an R package for this called Taxonomizr.
Download all tax assignments in NCBI and make SQLite database (this takes several hours). Store this file in a shared place instead of downloading it each time (because it’s huge)
prepareDatabase('databases/accessionTaxa.sql')
Read blast results from table into R and give it headers
blastResults<-read_table2('results/tax_sequences_blast.txt', col_names = FALSE)
[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────────────────────────────[39m
cols(
X1 = [31mcol_character()[39m,
X2 = [31mcol_character()[39m,
X3 = [32mcol_double()[39m,
X4 = [32mcol_double()[39m,
X5 = [32mcol_double()[39m,
X6 = [32mcol_double()[39m,
X7 = [32mcol_double()[39m,
X8 = [32mcol_double()[39m,
X9 = [31mcol_character()[39m,
X10 = [31mcol_character()[39m,
X11 = [31mcol_character()[39m,
X12 = [31mcol_character()[39m,
X13 = [31mcol_character()[39m,
X14 = [31mcol_character()[39m,
X15 = [31mcol_character()[39m
)
274 parsing failures.
row col expected actual file
2 -- 15 columns 19 columns 'results/tax_sequences_blast.txt'
4 -- 15 columns 23 columns 'results/tax_sequences_blast.txt'
5 -- 15 columns 20 columns 'results/tax_sequences_blast.txt'
6 -- 15 columns 13 columns 'results/tax_sequences_blast.txt'
7 -- 15 columns 16 columns 'results/tax_sequences_blast.txt'
... ... .......... .......... .................................
See problems(...) for more details.
blastResults <- unite(blastResults, Ref_Seq_title, X9, X10, X11, X12, X13, X14, X15, sep='_')
colnames(blastResults) <- c("ASV_ID", "ref_seq_ID", "PID", "alnmt_len", "mismatch", "eval", "bscore", "RefSeq_Tax_ID", "Ref_Seq_title")
blastResults
NOTE in the above,
bastn
program where it reported more than one top hit. But I checked these and most are duplicates which can basically be ignored. There is one with the same hit but lower PID, delete this one too. They are all contamination (Felis_catus!) anyway!Clean up blastResults table:
blastResults <- unique(blastResults)
rowtoremove <- filter(blastResults, ASV_ID == 'Seq_343' & PID <= 99.9)
blastResults <- anti_join(blastResults,rowtoremove)
Joining, by = c("ASV_ID", "ref_seq_ID", "PID", "alnmt_len", "mismatch", "eval", "bscore", "RefSeq_Tax_ID", "Ref_Seq_title")
blastResults
and grab accesion numbers
#grab the index of the accession nunmbers
# acc numbers are 4th |-separated field from the reference name in the second column
acc_matrix <- unlist(str_split(as.character(blastResults[,2]),'\\|'))
ind <- seq(from = 4, to = length(acc_matrix), by = 4)
accessions <- acc_matrix[ind]
accessions
[1] "MH538898.1" "KX686086.1" "MT410920.1" "LC104597.1" "MN883200.1"
[6] "KU053334.1" "NC_037016.1" "MT410920.1" "KX686094.1" "MT103925.1"
[11] "KX686079.1" "KX686071.1" "KX686078.1" "KX686086.1" "KX686090.1"
[16] "KX686073.1" "KX686097.1" "MH538871.1" "MH377822.1" "MT103924.1"
[21] "HM447585.1" "KX686078.1" "KX686086.1" "MH538898.1" "KU053334.1"
[26] "KX686086.1" "KX686086.1" "MT410920.1" "MH538898.1" "AF488502.1"
[31] "MN883200.1" "MH538898.1" "MH377810.1" "AY279649.1" "LC104597.1"
[36] "KJ564214.1" "KX686086.1" "KY400011.1" "MH538898.1" "AB445126.1"
[41] "KX686086.1" "MT410920.1" "KX686085.1" "KX686099.1" "KX686086.1"
[46] "MH538898.1" "MH538898.1" "MN883200.1" "LC104597.1" "MN883220.1"
[51] "LC104597.1" "LC091819.1" "MH538898.1" "MT410945.1" "KX686086.1"
[56] "AY279649.1" "MH377834.1" "MH538898.1" "KX686086.1" "MH538898.1"
[61] "KX686086.1" "NC_037016.1" "MW172501.1" "MT410920.1" "KU053334.1"
[66] "KX686071.1" "MT410869.1" "AY279649.1" "KX686086.1" "MT103925.1"
[71] "MH538898.1" "KU053334.1" "MH538898.1" "MH538898.1" "MT410920.1"
[76] "MH538898.1" "LC104597.1" "KX686086.1" "KX686086.1" "KX686086.1"
[81] "KX686086.1" "MH377768.1" "LC104597.1" "KP644368.1" "MH538898.1"
[86] "MH538898.1" "MH538898.1" "MT653633.1" "MH538898.1" "NC_037016.1"
[91] "KX686086.1" "KX686094.1" "MH538898.1" "KX686086.1" "KX686086.1"
[96] "MH538898.1" "KX686094.1" "MN883200.1" "MT410920.1" "MH538898.1"
[101] "KX686086.1" "NC_041097.1" "MT789711.1" "KX686086.1" "MT103925.1"
[106] "KX686086.1" "KX686086.1" "MH538898.1" "LC104597.1" "KX686094.1"
[111] "MN883200.1" "MT410920.1" "KX686086.1" "NC_037016.1" "KX686098.1"
[116] "KX686071.1" "MT410920.1" "MN883200.1" "LC091819.1" "MT410920.1"
[121] "KX686086.1" "KX686086.1" "MT410920.1" "MN883200.1" "MH538898.1"
[126] "KX686086.1" "KX686086.1" "KU053334.1" "KU053334.1" "MT410920.1"
[131] "MT103925.1" "MH538898.1" "MT410920.1" "MT103925.1" "KX686086.1"
[136] "MN883200.1" "KU053334.1" "KX686071.1" "LC021216.1" "MH538898.1"
[141] "KM605252.1" "MH538898.1" "MH538898.1" "MH538898.1" "MT410920.1"
[146] "NC_037016.1" "KX686079.1" "MH538898.1" "LC104597.1" "NC_041097.1"
[151] "MT410920.1" "LC104597.1" "LC104597.1" "KX686071.1" "KX686086.1"
[156] "NC_037016.1" "MW172501.1" "MH538898.1" "NC_037016.1" "LC104597.1"
[161] "KX686086.1" "KX686086.1" "KX686086.1" "MH538898.1" "MT410920.1"
[166] "MH538898.1" "KY400011.1" "MH377834.1" "KX686086.1" "MH538898.1"
[171] "KU053334.1" "KU053334.1" "KX686071.1" "MT410920.1" "MH538898.1"
[176] "MT410920.1" "MH377810.1" "KX686086.1" "MH538898.1" "MT103925.1"
[181] "KX686080.1" "LC104597.1" "MH538871.1" "MN883200.1" "MT410920.1"
[186] "MT410920.1" "MN883200.1" "MN883200.1" "MN883200.1" "KX686071.1"
[191] "KX686086.1" "MT410920.1" "MT410920.1" "MT410920.1" "MT103924.1"
[196] "NC_037016.1" "MT410920.1" "KX686078.1" "LC104597.1" "MT410920.1"
[201] "KX686071.1" "MT410920.1" "MH538898.1" "MT103924.1" "KX686086.1"
[206] "KX686094.1" "KU053334.1" "LC104597.1" "KX686086.1" "KX686078.1"
[211] "MT410920.1" "KX686086.1" "MH538898.1" "MH538898.1" "NC_037016.1"
[216] "KX686086.1" "NC_037016.1" "MN883200.1" "KX686086.1" "LC104597.1"
[221] "MH538898.1" "KX686071.1" "LC104597.1" "KX686086.1" "MN883200.1"
[226] "MT410920.1" "KU053334.1" "KX686086.1" "MH538898.1" "MH538898.1"
[231] "MH377833.1" "KX686086.1" "MT410920.1" "MT410920.1" "KX686086.1"
[236] "MH377822.1" "KX686094.1" "AF488502.1" "MH538898.1" "KU053334.1"
[241] "LC104597.1" "KX686097.1" "MH538898.1" "MH538898.1" "KU053334.1"
[246] "KX686094.1" "MH538898.1" "MH538898.1" "MT410920.1" "LC104597.1"
[251] "MH538898.1" "KX686086.1" "MH538898.1" "MT410920.1" "KX686086.1"
[256] "MT103925.1" "KX686086.1" "LC104597.1" "MH538898.1" "AY279649.1"
[261] "KX686079.1" "KX686094.1" "LC104597.1" "KX686086.1" "NC_041097.1"
[266] "KX686094.1" "MH538898.1" "MT410920.1" "MH538898.1" "MH538898.1"
[271] "KX686086.1" "KU053334.1" "KX686094.1" "NC_037016.1" "KX686086.1"
[276] "MT410920.1" "LC104597.1" "KX686086.1" "LC104597.1" "KX686094.1"
[281] "KX686086.1" "HM447585.1" "AY279649.1" "MH538898.1" "LC104597.1"
[286] "KX686086.1" "NC_037016.1" "MH538898.1" "KX686086.1" "HM447585.1"
[291] "MH538898.1" "KU053334.1" "AF488502.1" "KX686086.1" "KX686086.1"
[296] "NC_037016.1" "MH538898.1" "MN883200.1" "KX686086.1" "KX686086.1"
[301] "MH538898.1" "KX686086.1" "MH538898.1" "KX686079.1" "KX686086.1"
[306] "KX686086.1" "KU053334.1" "MH377822.1" "MH538898.1" "MH538898.1"
[311] "LC104597.1" "MH538898.1" "KX686097.1" "KY400011.1" "LC104597.1"
[316] "KX686079.1" "KJ564214.1" "AY279649.1" "KX686086.1" "KY400011.1"
[321] "KY400011.1" "KX686086.1" "MH377810.1" "KX686090.1" "KX686086.1"
[326] "KX686086.1" "KX686079.1" "NC_037016.1" "KU053334.1" "KX686086.1"
[331] "MT903916.1" "MN883200.1" "KU053334.1" "KP644368.1" "KX686086.1"
[336] "LC104597.1" "MH538898.1" "MH538898.1" "MH538871.1" "MH538898.1"
[341] "MH377822.1" "AP023162.1" "KX686079.1" "MT410920.1" "LC104597.1"
[346] "MH538898.1" "KY400011.1" "MT410945.1" "KX686094.1" "KX686086.1"
[351] "MH538898.1" "LC104597.1" "LC104597.1" "KX686086.1" "HM447585.1"
[356] "AY279649.1" "MH538898.1" "KX686081.1" "MH377810.1" "LC104597.1"
[361] "KX151652.1" "MH538898.1" "MT410920.1" "MH538898.1" "MN883200.1"
[366] "XM_012362088.1" "LC580279.1" "MH377794.1" "KX686073.1" "MH538898.1"
[371] "KT981806.1" "MT588226.1" "NC_037016.1" "XM_012362088.1" "GU733826.1"
[376] "MT789711.1" "MT410920.1" "KX686086.1" "GU733826.1" "MT410920.1"
[381] "MT410920.1"
Get taxonomy ID for each accesion number
taxaId<-accessionToTaxa(accessions,"databases/accessionTaxa.sql")
taxaId[1:10]
[1] 238744 224708 184585 75034 40503 66718 469659 184585 203315 183654
And get taxonomy for those tax IDs
taxonomy_table <- getTaxonomy(taxaId,'databases/accessionTaxa.sql')
taxonomy_table <- cbind(taxonomy_table, rownames(taxonomy_table))
colnames(taxonomy_table)[dim(taxonomy_table)[2]] <- "RefSeq_Tax_ID"
rownames(taxonomy_table) <- NULL
taxonomy_table <- as.data.frame(taxonomy_table)
taxonomy_table
Append taxonomy table to blast results table and export as one file
tax_sequences_blast_taxonomy <- cbind(blastResults, taxonomy_table)
write.table(tax_sequences_blast_taxonomy, file = "results/tax_sequences_blast_taxonomy.csv", sep = ",", col.names=NA)