Long Read Assembly with proprietary ONT-cgMLST-Polisher

Requirements

Important:

If Ridom Typer is installed on Windows, Long Read Assembly requires the Windows Subsystem for Linux (WSL).
If Ridom Typer is installed on Linux, Long Read Assembly must once be installed by calling the installation of Bioinformatic Tools on Linux.
Long Read Assembly requires at least 32 MB RAM.

Assembling Oxford Nanopore (ONT) data

Oxford Nanopore Technologies (ONT) FASTQ-files can be assembled and polished. An assembly pipeline can also directly monitor and process MinKNOW run data.

The following pipeline script options are available:

Trimming
- Chopper: Applies a headcrop (trim start of read) and tailcrop (end of read). Filtering is done on average read quality and minimal or maximal read length (by default turned on with quality 10 and minimum length 500).

Subsampling (and filtering)
- Rasusa: Randomly subsamples, in contrast to Filtlong, reads of different lengths to a specified coverage (by default turned on with coverage 100).
- Filtlong: Filters long reads by quality (longer is better) and subsamples. Might be beneficial if subsampling is applied with RBK and especially RPBK data.

De novo assembly
- Flye: Uses a repeat graph as the core data structure. Compared to de Bruijn graphs, which require exact k-mer matches, repeat graphs are built using approximate sequence matches and thereby can tolerate higher noise of reads. Does not correct the raw reads (in contrast to the canu assembler). States circularity and assembled coverage of contigs (runs with --nano-hq command by default) and is not fully deterministic, i.e., if the same dataset is re-analyzed not always the exact same results are obtained (default ONT assembler).
- Raven: Overlap-layout-consensus assembler which accelerates the overlap step, builds an assembly graph from reads that were pre-processed with pile-o-grams, and polishes the unambiguous graph paths with Racon. Does not correct the raw reads. States circularity and assembled coverage of contigs, is deterministic, and includes a Racon polishing step.

Polishing
- Medaka 2.0: Creates consensus sequences from nanopore sequencing data. This task is performed using neural networks applied to a pileup of individual sequencing reads against a draft assembly. Corrects only the FASTA consensus and not the FASTQ raw reads. If Rasusa or Filtlong was applied, medaka uses the subsampled reads only (by default turned on with model r1041_e82_400bps_bacterial_methylation).
- ONT-cgMLST-Polisher: The proprietary ONT-cgMLST-Polisher is part of the MBioSEQ Ridom Typer. First, it maps the with Dorado basecalled (>/= SUP 4.2 model) FASTQ reads to the from Medaka 2.0 derived assembly consensus FASTA sequence by using minimap2. Next, it scans the alignment for positions in the core and accessory genome MLST genes that might be indicative for methylation related sequencing errors, e.g., differing strand-specific majority consensus calls. Those ‘ambiguous’ positions are then compared against a sequence with a closely related cgMLST allelic profile. Finally, based on the comparison the consensus sequence of ambiguous positions is either confirmed or masked with a ‘N’ call (by default turned on).

Hybrid assemblies are not supported. For further information see our long-read de novo assembler evaluation.

Accuracy and Contiguity

For Accuracies and Contiguities evaluations see the links. Furthermore, was the ONT-cgMLST-Polisher tested (including Dorado model 5.0) in a recent ring-trial involving six different laboratories (Prior et al., 2025).

Assembling PacBio HiFi data

Only HiFi FASTQ files are supported as input files. The menu function Tools | Genome Utilities | Convert BAM to FASTQ can be used to convert HiFi reads from BAM format into FASTQ. When assembling PacBio HiFi FASTQ-files, trimming and polishing are disabled. The following options are available:

Subsampling (and filtering)
- Rasusa: Randomly subsamples, in contrast to Filtlong, reads of different lengths to a specified coverage (by default turned on with coverage 100).
- Filtlong: Filters long reads by quality (longer is better) and subsamples.

De novo assembly
- Flye: Uses a repeat graph as the core data structure. States circularity and assembled coverage of contigs (runs with --pacbio-hifi command) and is not fully deterministic, i.e., if the same dataset is re-analyzed not always the exact same results are obtained.

Processing Pre-Assembled FASTA contigs of Long-Read Data

When processing pre-assembled FASTA files of long-read data (e.g, PacBio HiFi assembly), the topology and coverage information is extracted from the contig headers of the FASTA files.

The following terms are recognized if stated in the FASTA header line of a contig:

If the line contains topology=circular if they contain a complete circular plasmid or chromosome. This term is used to define circular plasmids for Chromosome and Plasmids Overview Task Template processing. Knowing if a contig is circular might improve the MOB-suite plasmid reconstruction process.
If the line contains a term like coverage=100 to specify the coverage value of this contig. Contigs with coverage=0 are excluded from coverage calculation. All contigs in the FASTA file must have a coverage information, else no average assembled coverage is calculated. Knowing the average coverage helps for QC.
If the line contains [no-recon] the Chromosome and Plasmids Overview Task Template processing will skip the plasmid reconstruction process using the tool MOB-recon, so the contig will be treated as a complete plasmid (but not marked as circular).

Thereby, also a NCBI conform naming of the contigs can be achieved; e.g., for a circular chromosome:
>contig1_1710375900 [topology=circular][completeness=complete][chromosome];5261576;coverage=29

Using in the pipeline a tool like Circlator to fixstart (and orientation) helps tremendously for downstream visualization and comparisons of chromosomes and plasmids. For chromosomes Circlator uses for this function by default matches to the dnaA gene. For defining the start and orientation of most plasmids the CGE PlasmidFinder replicon database that is used for rep-typing could be utilized.

Contents

Requirements

Assembling Oxford Nanopore (ONT) data

Accuracy and Contiguity

Assembling PacBio HiFi data

Processing Pre-Assembled FASTA contigs of Long-Read Data