Why paired end reads




















In terms of data usage, more joined reads were used for inferring OTU abundance compared to the merged data. For example, 2,, of the directly joined reads could be mapped to the corresponding OTUs while only 1,, merged reads could be mapped. These indicate that read joining made better use of the real PE data for taxonomy annotation. In the following analyses, directly joined reads were used. To investigate whether asthma status affected microbiota, we compared the asthma attack and recovery samples using UniFrac [ 20 ] Methods.

Principal coordinate analysis revealed that asthma status was not a major factor for shaping the community structure Fig. In addition, the weighted UniFrac distances between two samples of the same individuals were significantly smaller than distances between samples of different individuals T test p -value 0. This suggests individual difference and that the asthma attack and recovery samples of the same individuals should be compared.

The two OTUs showed a higher proportion in the noses of four and three patients and lower in none respectively. Interestingly, those two genera have been associated with child asthma see Discussion , and therefore are promising for further experimental investigation.

Repeating the above analysis for the trimmed first reads failed to identify any differential OTU. For the merged reads, only Moraxella but not Sphingomonas was identified. This demonstrates the benefit of joining PE reads when their merges are limited, as in 16S studies.

Our simulation demonstrated that joining unmergeable PE reads could improve taxonomy annotation. The estimated benefit is conservative because we did not consider the possibility of correcting sequencing errors. In 16S studies, sequencing errors can be corrected via referring to other sequences in the data [ 21 , 22 ].

For example, clustering sequences into OTUs is also an act of error correction and the OTU representative sequences are usually highly accurate, which should enhance the benefit of read joining. S 5 and Table S 4. Although error correction is possible, we emphasize the importance of trimming low quality bases when analyzing real data.

In our data, for example, if the whole first and second reads were joined directly, almost all joined reads would fail to pass the filtering step of OTU clustering, which keeps only reads with less than one expected error. We tried increasing the threshold to ten, but almost all reads passing the filter are singletons, which seriously deteriorated the OTU clustering only ten OTUs were obtained.

Therefore, trimming low quality bases is necessary to ensure an appropriate OTU clustering. Note that we suggest trimming reads to a fixed length instead of quality trimming for OTU clustering.

Quality trimming usually results in trimmed reads of different lengths, which biases the clustering procedure [ 21 ]. Confining reference sequences to the amplicon region has been shown to improve taxonomy annotation [ 23 ]. For our real data, limiting references to the amplicon region indeed gave more confident OTUs at the genus level, e.

Although the improvement is not large, using amplicons as reference is usually favored. To extract amplicons, identifying primer sites via aligning primer to reference will fail if the reference sequences do not extend to the primer site. For example, among the 13, training sequences in the RDP 16S database, only covered the 27F primer site. JTax addresses this issue via selecting a long sequence that covers both primer sites as the main reference and extracting amplicon based on pairwise alignment between each reference sequence and the main reference.

For the V1-V3 primer pair, JTax output 13, amplicons and missed only six sequences because the bases did not make up at least half of the amplicons. We identified Moraxella and Sphingomonas as candidate bacterial genera associated with asthma exacerbation in children. Consistently, those bacteria have been implicated in childhood asthma.

For example, in acute respiratory illness, which is mainly caused by viral infection, Moraxella was also found to be more abundant in the nasopharynx of patients [ 24 ]. In fact, the causal effect of Moraxella in asthma exacerbation has recently been shown via animal experiments [ 25 ]. This suggests that the Moraxella species in the noses of some patients likely triggered the asthma exacerbation.

The genus Sphingomonas has been reported to be enriched in the house dust of children with asthma [ 26 ]. This may explain the enrichment of Sphingomonas in the noses of some asthmatic children during asthmatic attack.

In bronchial microbiome studies of asthmatic patients, the family Sphingomonodaceae has also been shown to be enriched [ 27 ] and highly correlated with the degree of bronchial hyper-responsiveness [ 28 ]. These corroboraing reports support validity of our experimental and analytical procedures. In metagenomic studies involving a marker gene, Illumina PE reads sometimes cannot be merged for taxonomy annotation.

Face with this problem, it is often not clear how to use the PE data effectively because a detailed evaluation of different approaches has been missing. Here, we rigorously evaluated procedures to utilize unmergeable PE data for classification by various top classifiers. Based on our results we make several suggestions.

First, joining PE reads into single reads is always recommended as read joining improved the classification accuracy in most of our investigations with simulated sequencing errors. Second, trimming reads to a fixed length before joining is suggested to optimize OTU clustering and classification. Third, the joining method direct joining or inside-out can affect performance of alignment-based classifiers, but not word-counting classifiers.

For alignment-based classifiers, rearranging reference is recommended to avoid problems caused by gaps between or the inverse order of paired reads. In general, a classifier based on global alignment is favored over one based on local alignment because the whole joined reads i.

For word-counting classifiers, rearranging the reference sequences did not make a difference in classification accuracy.

Therefore, joined reads can be directly compared to the original reference database. To further improve classification, amplicons instead of full-length sequences can be used as reference, although the improvement may be minor.

Amplicon extraction will fail when reference sequences do not extend to the primer site, but this can be rescued by JTax. Otherwise, JTax can be used. These recommendations should be useful for properly utilizing unmergeable PE data of a marker gene in metagenomic studies. Full-length 16S sequences with known taxonomy i. This alleviated the concern of unbalanced reference for performance evaluation. To implement the idea of cross-validation by identity, we designed a greedy algorithm to partition the TAXXI reference into training and testing datasets such that the alignment identity between each testing sequence and the best hit in the training data was within a certain range, e.

Readers are suggested to consult Fig. Given reference sequences and a primer pair, the corresponding amplicons were first extracted using JTax as confined references. As optimizing S and A was similar to minimizing Z, a reference r should be assigned to S or A earlier if it excluded fewer sequences.

Therefore, it was better to increase A slowly. Based on these ideas, for each reference r we defined tz r as the union of z t where t represented the top hits of r within the identity range. We then sorted the references by z r and tz r from small to large.

Starting with empty S, A, and Z, the first reference r was assigned to S, and the t r and z r were assigned to A and Z respectively. For the next reference r, if at least one of the t r had not been assigned to S or Z, r was assigned to S and the non-assigned t r were assigned to A. Otherwise, r was assigned to A if it had not been assigned to Z. To increase A slowly, the number of references assigned to A was limited to no more than three in each run.

This procedure was repeated for all references. The resulting A and S served as the training and testing datasets for evaluating taxonomy prediction.

Note that for the V1-V3 primer pair, some references did not extend to the primer site 27F, thus could only serve as training data but not testing data. For each testing sequence, three PEs were simulated.

For the hierarchical nature of taxonomy annotation, we calculated three types of errors: over-classification OC , under-classification UC , and misclassification MC rates at different taxonomy levels. At a level, an OC error occurred when the predicted rank did not exist in the training data.

For cross-validation by identity, mean accuracy of the five top-hit identities was also calculated. Exacerbated asthma without fever was defined as self-reported and physician-diagnosed current asthma presenting with a chief complaint of shortness of breath with an encounter diagnosis and need acute reliever treatment of asthma exacerbation.

Non-exacerbated asthma was defined as self-reported and physician-diagnosed current asthma presenting for routine, non-urgent, asthma follow-up care. We collected samples in duplicate using sterile cotton swabs from anterior nares of nasal cavities and retropharyngeal space of 12 asthmatic children at both acute asthma exacerbation and recovery phase 2-week apart.

Swabbed samples were kept in 1. All of the swab samples were transported to the core facility with ice packs within 1 h after collection.

Quantity and quality of the extracted DNA were analyzed by spectrophotometry using NanoDrop Spectrophotometer Thermo Scientific and by agarose gel electrophoresis. Index codes were added to Illumina sequencing adapters and dual-index barcodes to the amplicon target.

The higher the number of times that a base is sequenced, the better the quality of the data. For RNA-seq, we generally recommend a minimum of 20 million reads per sample. For sequencing projects that require higher accuracy — such as studies of alternate splicing — 40 million to 60 million paired-end reads will provide better results.

For more detailed analyses to determine, for example, allele-specific expression or expression of low-abundant transcripts, 60 million to million reads may be required. Genome Sequencing: Defining Your Experiment. Read length During sequencing, it is possible to specify the number of base pairs that are read at a time. Single-end vs. Interested in receiving newsletters, case studies, and information from Illumina based on your area of interest?

Sign up now. View Video x Sequencing Technology Video. Sequencing Platform Selection Tool Compare the speed and throughput of Illumina sequencing systems to find the best instrument for your lab. Launch Tool. Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. When you align them to the genome, one read should align to the forward strand, and the other should align to the reverse strand, at a higher base pair position than the first one so that they are pointed towards one another.

This is all for conventional paired-end sequencing. This is different from FR because it means the reverse read aligned at a lower base pair position than the forward read, and thus that they are pointing away from another. When I go back and pull out a sampling of the reads with flag value



0コメント

  • 1000 / 1000