7.1.2. Mikado prepare

This is the first executive step of the Mikado pipeline. It will accomplish the following goals:

  1. Collect annotations from disparate annotation files.
  2. Remove redundant assemblies, ie, assemblies that are identical across the various input files.
  3. Determine the strand of the transcript junctions.
  4. Ensure uniqueness of the transcript names.
  5. Order the transcript by locus.
  6. Extract the transcript sequences.

7.1.2.1. Usage

Mikado prepare allows to override some of the parameters present in the configuration file through command line options, eg. the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.

Available parameters:

  • configuration: the most important parameter. This is the configuration file created through Mikado configure.
  • fasta: reference genome. Required, either through the command line or through the configuration file.
  • out: Output GTF file, with the collapsed transcripts.
  • out_fasta: Output FASTA file of the collapsed transcripts.
  • start-method: multiprocessing start method.
  • verbose, quiet: flags to set the verbosity of Mikado prepare. It is generally not advised to turn the verbose mode on, unless there is a problem to debug, given the verbosity of the output.
  • strand-specific: If set, all assemblies will be treated as strand-specific.
  • strand-specific-assemblies: comma-separated list of strand specific assemblies.
  • strip-cds: some aligners (eg GMAP) will try calculate a CDS on the fly for alignments. Use this flag to discard all CDS information from all input transcripts.
  • exclude-redundant: if set, this flag instructs Mikado to look for and simplify redundant intron chains. By default, this option is disabled, or enabled on a per-sample basis. See this section for an explanation of redundancy removal in Mikado, and the section on the list input files below for an explanation on how to set this value on a per-sample basis (recommended).
  • codon-table: Mikado prepare will check the validity of the ORFs of input models. This value indicates which codon table Mikado should use for this purpose. See the section on the checks on CDSs.
  • strip-faulty-cds: when encountering a transcript with an invalid ORF due to e.g. in-frame stop codons, Mikado will usually discard the whole transcript. If this flag is set, Mikado will instead remove the CDS information and leave the transcript in place.
  • list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with details of the files to use. See this section for details.
  • log: log file. Optional, by default Mikado will print to standard error.
  • lenient: flag. If set, multiexonic transcripts without any canonical splice site will be output as well. By default, they would be discarded.
  • minimum-cdna-length: minimum length of the transcripts to be kept, default 200 bps.
  • max-intron-size: maximum length of introns for non-reference transcripts, default 1,000,000 bps. Transcripts with introns longer than this will be split into multiple pieces. Note: transcripts marked as “reference” will never be split in this way; Mikado will simply emit a warning in the log.
  • seed: integer seed to use for reproducibility. By default, Mikado will use the seed set in the configuration file.
  • random-seed: boolean switch. If selected, Mikado will use a random seed selected at runtime (and reported in the log)
  • single: flag that disables multiprocessing. Mostly useful for debug purposes.

Command line usage:

$ mikado prepare --help
usage: Mikado prepare [-h] [--fasta REFERENCE] [--verbose | --quiet | -lv {DEBUG,INFO,WARN,ERROR}] [--start-method {fork,spawn,forkserver}] [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST] [-l LOG]
                  [--lenient] [-m MINIMUM_CDNA_LENGTH] [-MI MAX_INTRON_LENGTH] [-p PROCS] [-scds] [--labels LABELS] [--codon-table CODON_TABLE] [--single] [-od OUTPUT_DIR] [-o OUT] [-of OUT_FASTA]
                  [--configuration CONFIGURATION] [-er] [--strip-faulty-cds] [--seed SEED | --random-seed]
                  [gff [gff ...]]

positional arguments:
  gff                   Input GFF/GTF file(s).

optional arguments:
  -h, --help            show this help message and exit
  --fasta REFERENCE, --reference REFERENCE
                        Genome FASTA file. Required.
  --verbose
  --quiet
  -lv {DEBUG,INFO,WARN,ERROR}, --log-level {DEBUG,INFO,WARN,ERROR}
                        Log level. Default: derived from the configuration; if absent, INFO
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  -s, --strand-specific
                        Flag. If set, monoexonic transcripts will be left on their strand rather than being moved to the unknown strand.
  -sa STRAND_SPECIFIC_ASSEMBLIES, --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
                        Comma-delimited list of strand specific assemblies.
  --list LIST           Tab-delimited file containing rows with the following format: <file> <label> <strandedness(def. False)> <score(optional, def. 0)> <is_reference(optional, def. False)>
                        <exclude_redundant(optional, def. True)> <strip_cds(optional, def. False)> <skip_split(optional, def. False)> "strandedness", "is_reference", "exclude_redundant", "strip_cds" and
                        "skip_split" must be boolean values (True, False) "score" must be a valid floating number.
  -l LOG, --log LOG     Log file. Optional.
  --lenient             Flag. If set, transcripts with only non-canonical splices will be output as well.
  -m MINIMUM_CDNA_LENGTH, --minimum-cdna-length MINIMUM_CDNA_LENGTH
                        Minimum length for transcripts. Default: 200 bps.
  -MI MAX_INTRON_LENGTH, --max-intron-size MAX_INTRON_LENGTH
                        Maximum intron length for transcripts. Default: 1,000,000 bps.
  -p PROCS, --procs PROCS
                        Number of processors to use (default None)
  -scds, --strip_cds    Boolean flag. If set, ignores any CDS/UTR segment.
  --labels LABELS       Labels to attach to the IDs of the transcripts of the input files, separated by comma.
  --codon-table CODON_TABLE
                        Codon table to use. Default: 0 (ie Standard, NCBI #1, but only ATG is considered a valid start codon.
  --single, --single-thread
                        Disable multi-threading. Useful for debugging.
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -o OUT, --out OUT     Output file. Default: mikado_prepared.gtf.
  -of OUT_FASTA, --out_fasta OUT_FASTA
                        Output file. Default: mikado_prepared.fasta.
  --configuration CONFIGURATION, --json-conf CONFIGURATION
                        Configuration file.
  -er, --exclude-redundant
                        Boolean flag. If invoked, Mikado prepare will exclude redundant models,ignoring the per-sample instructions.
  --strip-faulty-cds    Flag. If set, transcripts with an incorrect CDS will be retained but with their CDS stripped. Default behaviour: the whole transcript will be considered invalid and discarded.
  --seed SEED           Random seed number. Default: 0.
  --random-seed         Generate a new random seed number (instead of the default of 0)

7.1.2.2. Collection of transcripts from the annotation files

Different assemblers will produce data in different formats, typically in GFF or GTF format, and not necessarily in the same order (if any is present). Mikado will serialise the transcripts from these files and port them all into a standard GTF format. Moreover, it will ensure that each transcript ID appears only once across the input files. The optional labels provided for each file will be attached to the transcript names as prefixes, and used as the source field in the output GTF, to ensure the uniqueness of each transcript name. If two or more transcripts are found to be identical, only one will be retained, chosen at random among all the possibilities. In addition to this, Mikado prepare will also sort the transcripts by coordinate, irrespective of strand, so that they are suitably displayed for the divide-et-impera algorithm of Mikado pick.

When two or more identical transcripts are present in a locus, Mikado will use the (optionally provided) source score to select the a priori best assembly amongst the choices. For example, if a mikado prepare run comprises both PacBio reads and Illumina assemblies and the experimenter has given a score of 1 or more to the former dataset but not the latter, if a PacBio read is present together with a stringtie assembly, the PacBio read will always be selected over the StringTie. Please note that this “score-based” selection *only operates for transcripts that are identical. No other selection is performed at this stage.

Warning

To be considered identical, two transcripts must match down to the last base pair. A simple match or containment of the intron chain will not suffice. This is because using the cDNA data alone it is difficult to understand whether the longer form(s) is the correct assembly rather than a chimera or a trans-splice event.

Note

From version 1.3 onwards, Mikado considers the CDS as well when performing the redundancy check. So, two transcripts having the same coordinates but different CDS (because of non-overlapping ORFs or disagrement on the frame and/or start codon position) will be kept as non-redundant.

Note

Transcripts that are considered to come from a “reference” assembly are never going to be excluded, and will always be prioritised over other assemblies.

7.1.2.2.1. Removal of redundant transcripts

Many third-party tools, e.g. gffread [GffRead], try to simplify transcript assemblies by grouping transcripts according to their intron chains and then keeping only one transcript per group, usually the longest. This removes transcripts with identical intron chains as well as transcripts whose intron chain is completely contained within another one in the group. In most cases, Mikado explicitly does not take this approach because, especially with RNASeq assemblies, longer transcripts might not necessarily be the most correct; rather, in a non-negligible portion of cases, longer transcripts might have originated by an artefactual fusion of two different, neighbouring transcripts. The implicit assumption made by e.g. gffread (that shorter transcripts are the result of fragmentation of the longer transcripts) would therefore lead to incorrect assemblies. The default approach taken by Mikado, therefore, is to identify cases where transcripts are completely identical (both in terms of cDNA and CDS, if kept), and only remove redundancies in those rare, specific cases.

In certain situations, however, a strategy based on intron chain redundancy like in gffread might be warranted. Specifically:

  • in long read datasets (e.g. PacBio or ONT alignments) the implicit assumption made by gffread is valid: in these cases it is safe to assume that fragmentation during RNA extraction and library preparation would constitute the main origin of redundancy.
  • when dealing with massive transcript datasets (>=5-10 million transcripts), removing excess transcripts might be necessary to keep the analysis manageable, at the cost of slightly reduced accuracy.

Mikado allows to perform this more extensive redundancy removal either on a per-analysis or per-sample (recommended) basis. When scanning the transcript assemblies, Mikado will look for intron chains that are completely contained within another. When such an occurence arises, if and only if Mikado has been instructed to remove redundant cases, Mikado will do the following:

  • if one of the two transcripts comes from a sample for which the redundancy removal is disabled (including, automatically, all “reference” samples), it will always be kept.
  • if the transcript marked for redundancy check and removal has a lower baseline score or is contained within the other transcript, it will be marked for removal.

7.1.2.2.2. Check on strand correctness

During its run, Mikado prepare will also check the correctness of the transcripts. In particular:

  • Unless the assembly is marked as strand-specific, any monoexonic transcript will have its strand removed.
  • If a transcript contains canonical splice junctions on both strands, it will be completely removed
  • If a transcript contains only non-canonical splice junctions, it will be removed unless the “lenient” option is specified either at the command line or in the configuration file.

The couples of splice acceptors and donors which are considered as canonical can be specified in the configuration file. By default, Mikado will consider as canonical both properly canonical splicing event (GT-AG) as well as the semi-canonical events (GC-AG, AT-AC). Any other couple will be considered as non-canonical.

Warning

Mikado will check the strand of each junction inside a transcript independently. Therefore, if a transcript with 9 junctions on the plus strand is found to have a non-canonical splicing junction which happens to be the reverse of a canonical one (eg. CT-AC), it will deem this junction as misassigned to the wrong strand and flip it to the minus strand. In this example, the transcript will therefore be considered as an error as it contains both + and - junctions. If Mikado prepare is running in lenient mode, these transcripts will be marked as having a “suspicious_splicing” and kept in the output. Otherwise, they will be discarded outright.

Warning

The same considerations as above apply to transcripts that have no recognisable strand for their junctions. Such transcripts will be removed outright if lenient is False.

Note

Starting from Mikado version 1.3, transcripts can be tagged as being from an assembly of “reference” quality.

This implies that:

  • A transcript which is marked as “reference” will never have its CDS stripped
  • A transcript which is marked as “reference” will never be marked for removal due to redundancy, even if there are multiple copies of it, or if other assemblies with a higher score have identical transcripts (normally only one transcript would be retained, and that would be chosen amongst the highest scoring assemblies)
  • A transcript which is marked as reference will never have its strand removed or flipped.

Please see the configuration help page for details.

7.1.2.2.3. Check on ORF correctness

Mikado will check that the input transcripts have a formally valid ORF both in terms of the structure as well as of its sequence.

By “formally correct ORF structure”, Mikado means that:

  • CDS segments must be contained within declared exons
  • CDS segments should not overlap each other or any declared UTR segment
  • there should be no gap between CDS segments on a transcript’s cDNA
  • including the initial phase, the length of the CDS should be a multiple of three.

By “formally correct ORF sequence”, Mikado means that:

  • there are no internal in-frame stop codons

To perform the latter check, Mikado will use the specified codon table (by default “0”, ie the NCBI Standard codon table but only considering “ATG” as a valid start codon).

Finally, Mikado will infer whether the transcript has a start and/or stop codon (again, using the selected codon table) and tag it appropriately.

7.1.2.2.4. Output files

Mikado prepare will produce two files:

  • a sorted GTF file, containing all the transcripts surviving the checks
  • a FASTA file of the transcripts, in the proper cDNA orientation.

Warning

contrary to other tools such as eg gffread [GffRead], Mikado prepare will not try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will not have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately.