Dear Trinity developpers,

I have been using the program to assemble the transcriptome of an organism without a good quality genome. The purpose of this transcriptome is to be used for differential expression analysis.

Since I wanted the transcriptome to be as complete as possible I relied on high coverage. For my first assembly I used ~1 billion single-end reads, these reads are from different libraries and the length range goes from 25 to 75 bp. After running Trinity I got 320276 contigs longer than 100 bp. As some of the contigs were redundant I decided to used minimus2 to merge very similar sequences, and I ended up with more than 100000 contigs. When I mapped the libraries that I want to analyze for differential expression more than 89 % of the reads per library were mappable and around 1-2 % of the reads were aligning to more than one place, so I was quite happy with these results.

I understand that with very high coverage I could be assembling transcripts that originated from pervasive transcription, and I think that this could be one of the reasons why I am getting over 100000 contigs, besides fragmentation. But this might not be a major problem since I could apply filters before proceeding with the differential expression analysis, for example I could only keep contigs that have more than x number of reads.

However, now I have access to paired-end libraries and to other single-end libraries that have longer reads (75-100 bp), therefore I assembled these new libraries to check whether paired-end information could solve fragmentation problems. All together I had 2682537009 reads and I used in silico normalization to reduce the number of reads to 107846820. For the normalization I did not used the --PARALLEL_STATS parameter (memomy limitations). I followed the same pipeline than with my first assembly and after Trinity and minimus2 I have 200655 contigs. Nevertheless, now I have more redundant contigs, for instance some components have more than 100 sequences and they are sharing at least 50 bp; and I do not see an evident gain in completeness when I compare my two transcriptomes, at least with the assays that I carried out, such as checking the number and coverage of orthologous sequences that are included in the assemblies, using as reference the transcriptome of the phylogenetically closest organism.

At this point I do not know whether it will be better to use my first assembly for further analysis or to try to assemble my new dataset again using options like --REDUCE (as previously suggested http://sourceforge.net/mailarchive/forum.php?thread_name=CAJCu8qPrOE1BjFPk%2BYA_Gkk_s8wr29afWEBPPk01DAeaxiE8wA%40mail.gmail.com&forum_name=trinityrnaseq-users) or --max_number_of_paths_per_node to solve redundancy problems and --min_kmer_cov to not include so lowly expressed transcripts. I would appreciate any comments or suggestions on this regard.

Regards,
Jose Trejo Uribe