_images/mikado-logo.png

Mikado: pick your transcript

releases downloads conda downloads python_36_badge python_37_badge python_38_badge python_39_badge gh_action_status coverage

Authors:Venturini Luca, Yanes Luis, Caim Shabhonam, Mapleson Daniel, Kaithakottil Gemy George, Swarbreck David
Version:Mikado v2.2.4 (06 April 2021)

Mikado is a lightweight Python3 pipeline to identify the most useful or “best” set of transcripts from multiple transcript assemblies. Our approach leverages transcript assemblies generated by multiple methods to define expressed loci, assign a representative transcript and return a set of gene models that selects against transcripts that are chimeric, fragmented or with short or disrupted CDS. Loci are first defined based on overlap criteria and each transcript therein is scored based on up to 50 available metrics relating to ORF and cDNA size, relative position of the ORF within the transcript, UTR length and presence of multiple ORFs. Mikado can also utilize blast data to score transcripts based on proteins similarity and to identify and split chimeric transcripts. Optionally, junction confidence data as provided by Portcullis [Portcullis] can be used to improve the assessment. The best-scoring transcripts are selected as the primary transcripts of their respective gene loci; additionally, Mikado can bring back other valid splice variants that are compatible with the primary isoform.

Mikado uses GTF or GFF files as mandatory input. Non-mandatory but highly recommended input data can be generated by obtaining a set of reliable splicing junctions with Portcullis, by locating coding ORFs on the transcripts using Transdecoder, and by obtaining homology information through BLASTX [Blastplus].

Our approach is amenable to include sequences generated by de novo Illumina assemblers or reads generated from long read technologies such as PacBio.

Our tool was presented at Genome Science 2016, both as a poster and in a talk during the Bioinformatics showcase session.

Mikado was published in GigaScience in August 2018 [Mikado]. We provide a PDF copy of the open access paper on this website, for reference.

Development is currently active, and Mikado is tightly integrated in an upcoming pipeline for genome annotation refinement, Minos.

Mikado version 2: integrating multiple gene predictions

During the summer of 2019, we finished work on the new version of Mikado. The focus on the work was to make Mikado a software product capable of integrating the results of multiple gene annotations, similarly to [PASA] or [Maker2]. Contrary to Maker2, Mikado is not in itself a full annotation pipeline; we are currently work on one such, which will use Mikado first to clean the transcripts assemblies, and then to create a final gene annotation by comparing multiple ab initio annotations together with protein alignments and transcript assemblies or cDNA alignments.

Starting from this version, Mikado is therefore capable of considering arbitrary measures of transcript quality (e.g. transcript quantification or similarity of the predicted ORF against a known protein database); moreover, it is capable of reconcile the structures of the transcripts present in a single locus. This allows to e.g. add an inferred UTR for ab initio predictions using complementary RNAseq data. Through this mechanism, Mikado is also capable of reconstructing the correct ORF of transcripts present only in fragmentary form - as long as there is at least another transcript in the locus that can provide the missing data. This mechanism is similar to the one implemented in [PASA]. Please see the relevant section in Algorithms for details.

Citing

If you use Mikado in your work, please consider to cite:

Venturini L., Caim S., Kaithakottil G., Mapleson D.L., Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience, Volume 7, Issue 8, 1 August 2018, giy093, doi:10.1093/gigascience/giy093

If you also use Portcullis to provide reliable junctions to Mikado, either independently or as part of the Daijin pipeline, please consider to cite:

Mapleson D.L., Venturini L., Kaithakottil G., Swarbreck D. Efficient and accurate detection of splice junctions from RNAseq with Portcullis. GigaScience, Volume 7, Issue 12, 12 December 2018, giy131, doi:10.1093/gigascience/giy131

Availability and License

Open source code available on github: https://github.com/EI-CoreBioinformatics/mikado

For Linux and OSX (the latter only since v2.2.3) we also provide installation through Conda: https://anaconda.org/bioconda/mikado <https://anaconda.org/bioconda/mikado>.

Please report any issue you might encounter to the EI-CoreBioinformatics issue tracker.

This documentation is hosted publicly on read the docs: https://mikado.readthedocs.org/en/latest/

Mikado is available under GNU LGLP V3.

Acknowledgements

Mikado has greatly benefited from the public libraries, in particular [Cython], the [NetworkX] library, Scipy, Numpy and Pandas ([Scipy], [Numpy], [Pandas]), BioPython [BioPython], Intervaltree [PYinterval], and the BX library for a Cython implementation of interval trees [BXPython]. Moreover, Mikado makes liberal use of the PySAM [PySAM] library for analysing SAM/BAM files as well as for working with FASTA files. Mikado has also been constantly optimised using Snakeviz [Snakeviz], a tool which proved invaluable during the development process.

Credits

  • Luca Venturini (The software architect and developer)
  • Shabhonam Caim (Primary tester and analytic consultancy)
  • Daniel Mapleson (Developer of PortCullis and of the Daijin pipeline)
  • Luis Yanes (Software developer)
  • Gemy Kaithakottil (Tester and analytic consultancy)
  • David Swarbreck (Annotation guru and ideator of the pipeline)

Contents

Introduction

Numerous algorithms have been proposed to analyse RNA-Seq data, both in terms of aligning the reads to a reference genome ([TopHat2], [STAR], [Hisat]) or to assemble them to infer the sequence and structure of the original molecules present in the sample. The latter phase can be performed either by using alignment data against the reference genome ([Cufflinks], [StringTie], [Class2], [Trinity]) or in the absence of such information ([Trinity], [Oases], [Bridger], ). Each of these methods has to contend with numerous sources of variability in the RNA-Seq data:

  • alternative splicing events at the same locus
  • extremely variable expression and therefore sequencing depth at different loci
  • presence of orthologous genes with very similar cDNAs.
  • tandem duplicated genes which can be artefactually reconstructed as a single, fused entity

Multiple assessments of these methods on real and simulated data. In particular, the RGASP consortium promoted a competition among some of the most popular methods in 2012 [RGASP]; the results showed that the accuracy of the methods was dependent on the input data, the expression levels of the genes, and on the species under analysis. A different concern regards the completeness of each of the assembly methods, ie whether methods that are not necessarily the best across all gene loci might nonetheless be capable of outperforming other competitors for specific contigs. Recently, approaches have been proposed to integrate multiple RNA-Seq assemblies by inferring their protein-coding content and collapsing them accordingly [EviGeneTobacco] or to determine the best proposed assembly using multiple measures related to how the transcript compares to the input RNA-Seq reads ([Transrate], [RSEMeval]).

A different but related line of research has focused on how to integrate data coming from multiple samples. While some researchers advocate for merging together the input data and assembling it [Trinity], others have developed methods to integrate multiple assemblies into a single coherent annotation ([CuffMerge], Stringtie-merge). Another common approach relies on a third party tool to integrate multiple assemblies together with other data - typically protein alignments and ab initio predictions - to select for the best model. The latter approach has been chosen by some of the most popular pipelines for genome annotation in the last years ([EVM], [Maker]).

Our tool, Mikado, contributes to this area of research by proposing a novel algorithm to integrate multiple transcript assemblies, leveraging additional data regarding the position of ORFs (generally derived with Transdecoder [Trinity]), sequence similarity (derived with BLAST [Blastplus]) and reliable splicing junctions (generally derived using Portcullis [Portcullis]). Through the combination of these input data sources, we are capable of identifying artefacts - especially gene fusions and small fragments - and retrieve the original transcripts in each locus with high accuracy and sensitivity.

An important caveat is that our approach does not look for the best candidate in terms of the input data, as Transrate [Transrate] or RSEM-Eval [RSEMeval] do. Rather, our approach is more similar to EvidentialGene [EviGeneTobacco] as Mikado will try to find the candidates most likely to represent the isoforms which would be annotated as canonical by a human annotator. Biological events such as intron retentions or trans-splicing that might be present in the sample are explicitly selected against, in order to provide a clean annotation of the genome.

Overview

Mikado analyses an ensemble of RNA-Seq assemblies by looking for overlapping transcripts based on their genomic position. Such clusters of genes, or superloci, are then analysed to locate the gene loci that will form the final annotation. In order, the steps of the pipeline are as follows:

  1. In the first step, mikado prepare will combine assemblies from multiple sources into a single coherent, filtered dataset.
  2. In the second step, mikado serialise will collect data from multiple sources (Portcullis, Transdecoder, BLASTX) and load it into a database for fast integration with the input data.
  3. In the final step, mikado pick will integrate the prepared transcripts with the serialised additional data to choose the best transcripts.

By far, the most important source of data for Mikado is the ORF calling, which can be executed with Prodigal or Transdecoder. This is because many of the algorithms for finding and splitting chimeras depend on the detection of the ORFs on the transcript sequences. Moreover, the ORF features are a very important part of the scoring mechanism. Please refer to the Algorithms section for more details on how Mikado operates.

Using these multiple data sources, and its distinctive iterative method to identify and disentangle gene loci, Mikado is capable of bringing together methods with very different results, and interpret the composition of a locus similarly to how a manual annotator would. This leads to cleaned RNA-Seq annotations that can be used as the basis of the downstream analysis of choice, such as eg a hint-based ab initio annotation with Augustus [Augustus] or Maker2 [Maker].

Mikado in action: recovering fused gene loci

_images/locus_example.jpeg

Using data from the ORFs, Mikado (in red) is capable of identifying and breaking artefactual gene fusions found by the assemblers (in green) in this locus. This allows to report the correct gene loci, even when no method was capable of retrieving all the loci in isolation, and some of the correct genes were not present at all. Coding sections of Mikado models are in dark red, UTR segments are in pale red. The reference annotation is coloured in blue, with the same colour scheme - dark for coding regions, pale for UTRs.

The command-line interface

The Mikado suite provides two commands: mikado and daijin. The former provides access to the main functionalities of the suite:

$ mikado --help
usage: Mikado [-h] {configure,prepare,serialise,pick,compare,util} ...

Mikado is a program to analyse RNA-Seq data and determine the best transcript
for each locus in accordance to user-specified criteria.

optional arguments:
  -h, --help            show this help message and exit

Components:
  {configure,prepare,serialise,pick,compare,util}
                        These are the various components of Mikado:
    configure           This utility guides the user through the process of
                        creating a configuration file for Mikado.
    prepare             Mikado prepare analyses an input GTF file and prepares
                        it for the picking analysis by sorting its transcripts
                        and performing some simple consistency checks.
    serialise           Mikado serialise creates the database used by the pick
                        program. It handles Junction and ORF BED12 files as
                        well as BLAST XML results.
    pick                Mikado pick analyses a sorted GTF/GFF files in order
                        to identify its loci and choose the best transcripts
                        according to user-specified criteria. It is dependent
                        on files produced by the "prepare" and "serialise"
                        components.
    compare             Mikado compare produces a detailed comparison of
                        reference and prediction files. It has been directly
                        inspired by Cufflinks's cuffcompare and ParsEval.
    util                Miscellaneous utilities

Each of these subcommands is explained in detail in the Usage section.

daijin instead provides the interface to the Daijin pipeline manager, which manages the task of going from a dataset of multiple reads to the Mikado final picking. This is its interface:

$ daijin --help

usage: A Directed Acyclic pipeline for gene model reconstruction from RNA seq data.
        Basically, a pipeline for driving Mikado. It will first align RNAseq reads against
        a genome using multiple tools, then creates transcript assemblies using multiple tools,
        and find junctions in the alignments using Portcullis.
        This input is then passed into Mikado.
       [-h] {configure,assemble,mikado} ...

optional arguments:
  -h, --help            show this help message and exit

Pipelines:
  {configure,assemble,mikado}
                        These are the pipelines that can be executed via
                        daijin.
    configure           Creates the configuration files for Daijin execution.
    assemble            A pipeline that generates a variety of transcript
                        assemblies using various aligners and assemblers, as
                        well a producing a configuration file suitable for
                        driving Mikado.
    mikado              Using a supplied configuration file that describes all
                        input assemblies to use, it runs the Mikado pipeline,
                        including prepare, BLAST, transdecoder, serialise and
                        pick.
Installation
System requirements

Mikado requires CPython 3.6 or later to function (Python2 is not supported). Additionally, it requires a functioning installation of one among SQLite, PostgreSQL and MySQL. Mikado has additionally the following python dependencies:

biopython>=1.78
datrie>=0.8
docutils
drmaa
hypothesis
msgpack>=1.0.0
networkx>=2.3
numpy>=1.17.2
pandas>=1.0
pip
pysam>=0.15.3
pyyaml>=5.1.2
scipy>=1.3.1
snakemake>=5.7.0
sqlalchemy>=1.3.9,<1.4.0
sqlalchemy-utils>=0.34.1
tabulate>=0.8.5
pytest>=5.4.1
python-rapidjson>=1.0.0
toml>=0.10.0
pyfaidx>=0.5.8
dataclasses; python_version < '3.7'
marshmallow
marshmallow-dataclass
typeguard  # Necessary for mashmallow apparently

Mikado can run with little system requirements, being capable of analysing human data with less than 4GB of RAM in all stages. It benefits from the availability of multiple processors, as many of the steps of the pipeline are parallelised.

Mikado is at its core a data integrator. The Daijin pipeline has been written to facilitate the creation of a functioning workflow. To function, it requires Snakemake [Snake] and the presence of at least one supported RNA-Seq aligner and one supported transcript assembler. If you are planning to execute it on a cluster, we do support job submission on SLURM, LSF and PBS clusters, either in the presence or absence of DRMAA.

Download

Mikado is available through BioConda; to install it, select or configure a python3 Conda environment, add the bioconda channel to your environment, and then install it with:

conda install -c bioconda mikado.

This will also take care of installing companion tools such as PortCullis. Even with conda, BLAST+, Prodigal, Diamond and TransDecoder have to be installed separately. This can be achieved with:

conda install -c bioconda prodigal blast transdecoder diamond

Mikado is available on PyPI, so it is possible to install it with

pip3 install mikado

The source for the latest release on PyPI can be obtained with

pip3 download mikado

As the package contains some core Cython components, it might be necessary to download and compile the source code instead of relying on the wheel.

Alternatively, Mikado can be installed from source by obtaining it from our GitHub repository. Either download the latest release or download the latest stable development snapshot with

git clone https://github.com/EI-CoreBioinformatics/mikado.git; cd mikado

Install using containers

We support both Docker and Singularity as container technologies. On GitHub, we currently provide:

  • A Docker file tracking the “master” github branch, with a Ubuntu 20.04 guest
  • A Singularity recipe tracking the “master” github branch, with a Ubuntu 20.04 guest
  • A Singularity recipe tracking the “master” github branch, with a Centos 7 guest

We plan to release them in the Docker and Singularity hubs.

Building and installing from source

If you desire to install Mikado from source, this can be achieved with

`bash pip install -U . `

Testing the installed module

It is possible to test whether Mikado has been built successfully by opening a python3 interactive session and digiting:

>>  import Mikado
>>  Mikado.test()

Alternatively, you use pytest:

$ pytest --pyargs Mikado

This will run all the tests included in the suite.

Tutorial

This tutorial will guide you through a simple analysis of Mikado, using a small amount of data coming from an experiment on Arabidopsis thaliana. RNA-Seq data was obtained from study PRJEB7093 on ENA, aligned with STAR [STAR] against the TAIR10 reference genome, and assembled with four different programs. For this small example, we are going to focus on a small genomic region: Chr5, from 26,575,364 to 26,614,205.

During this analysis, you will require the following files:

All of this data can also be found in the sample_data directory of the Mikado source.

You will also require the following software:

  • a functioning installation of SQLite.
  • a functioning version of BLAST+ [Blastplus].
  • a functioning version of Prodigal [Prodigal].
Creating the configuration file for Mikado

In the first step, we need to create a configuration file to drive Mikado. To do so, we will first create a tab-delimited file describing our assemblies (class.gtf, cufflinks.gtf, stringtie.gtf, trinity.gff3):

class.gtf       cl      True            False   False   True
cufflinks.gtf   cuff    True            False   False   True
stringtie.gtf   st      True    1       False   True    True
trinity.gff3    tr      False   -0.5    False   False   True
reference.gff3  at      True    5       True    False   False
pacbio.bam      pb      True    1       False   False   False

In this file, the first three fields define the following:

#. The file location and name (if no folder is specified, Mikado will look for each file in the current working directory) #. An alias associated with the file, which has to be unique #. A binary flag (True / False) indicating whether the assembly is strand-specific or not

These fields are then followed by a series of optional fields:

  1. A score associated with that sample. All transcripts associated with the label will have their score corrected by the value on this field. So eg. in this example all Stringtie models will receive an additional point, and all Trinity models will be penalised by half a point. Class and Cufflinks have no special bonus or malus associated.
  2. A binary flag (True / False) defining whether the sample is a reference or not.
  3. A binary flag (True / False) defining whether to exclude redundant models or not.
  4. A binary flag (True / False) indicating whether Mikado prepare should strip the CDS of faulty models, but otherwise keep their cDNA structure in the final output (True) or whether instead it should completely discard such models (False).
  5. A binary flag (True / False) instructing Mikado about whether the chimera split routine should be skipped for these models (True) or if instead it should proceed normally (False).

Finally, we will create the configuration file itself using mikado configure:

mikado configure --list list.txt --reference chr5.fas.gz --mode permissive --scoring plants.yaml  --copy-scoring

plants.yaml –junctions junctions.bed -bt uniprot_sprot_plants.fasta configuration.yaml

This will create a configuration.yaml file with the parameters that were specified on the command line. This is simplified configuration file, containing all the necessary parameters for the Mikado run. It will also copy the plants.yaml file from the Mikado installation to your current working directory.

Hint

Mikado can accept compressed genome FASTA files, like in this example, as long as they have been compressed with BGZip rather than the vanilla UNIX GZip.

  • –list list.txt: this part of the command line instructs Mikado to read the file we just created to understand where the input files are and how to treat them.
  • –scoring: the scoring file to use. Mikado ships with two pre-calculated scoring files, plant.yaml and

mammalian.yaml * –copy-scoring: instruct Mikado to copy the scoring file from the installation directory to the current directory, so that the experimenter can modify it as needed. * –reference chr5.fas: this part of the command line instructs Mikado on the location of the genome file. * –mode permissive: the mode in which Mikado will treat cases of chimeras. See the :ref:`documentation

<chimera_splitting_algorithm>` for details.
  • –junctions junctions.bed: this part of the command line instructs Mikado to consider this file as the source of reliable splicing junctions.
  • -bt uniprot_sprot_plants.fasta: this part of the command line instructs Mikado to consider this file as the BLAST database which will be used for deriving homology information.

Hint

The –copy-scoring argument is usually not necessary, however, it allows you to easily inspect the

scoring file we are going to use during this run.

Hint

Mikado provides a handful of pre-configured scoring files for different species. However, we do recommend

inspecting and tweaking your scoring file to cater to your species. We provide a guide on how to create your own configuration files here.

Mikado prepare

The subsequent step involves running mikado prepare to create a sorted, non-redundant GTF with all the input assemblies. As we have already created a configuration file with all the details regarding the input files, this will require us only to issue the command:

mikado prepare --json-conf configuration.yaml

This command will create three files:

  1. mikado_prepared.gtf: one of the two main output files. This is a sorted, non-redundant GTF containing the transcripts from the four input GTFs

  2. mikado_prepared.fasta: a FASTA file of the transcripts present in mikado_prepared.gtf.

  3. prepare.log: the log of this step. This should look like the following, minus the timestamps:

    2016-08-10 13:53:58,443 - prepare - prepare.py:67 - INFO - setup - MainProcess - Command line: /usr/users/ga002/venturil/py351/bin/mikado prepare --json-conf configuration.yaml
    2016-08-10 13:53:58,967 - prepare - prepare.py:206 - INFO - perform_check - MainProcess - Finished to analyse 95 transcripts (93 retained)
    2016-08-10 13:53:58,967 - prepare - prepare.py:405 - INFO - prepare - MainProcess - Finished
    

At the end of this phase, you should have 93 candidate transcripts, as 2 were redundant.

BLAST of the candidate transcripts

Although it is not strictly necessary, Mikado benefits from integrating homology data from BLAST. Mikado requires this data to be provided either in XML or ASN format (in the latter case, blast_formatter will be used to convert it in-memory to XML).

To create this file, we will proceed as follows:

  1. Uncompress the SwissProt database:

    gzip -dc uniprot_sprot_plants.fasta.gz > uniprot_sprot_plants.fasta
    
  2. Prepare the database for the BLAST:

    makeblastdb -in uniprot_sprot_plants.fasta -dbtype prot -parse_seqids > blast_prepare.log
    
  3. Execute the BLAST, asking for XML output, and compress it to limit space usage.

    blastx -max_target_seqs 5 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send
    
evalue bitscore ppos btop”
-num_threads 10 -query mikado_prepared.fasta -db uniprot_sprot_plants.fasta -out mikado_prepared.blast.tsv

This will produce the mikado_prepared.blast.tsv file, which contains the homology information for the run.

Warning

Mikado requires a custom tabular file from BLAST, as we rely on the information on extra fields such as e.g. btop. Therefore the custom fields following -outfmt 6 are not optional.

ORF calculation for the transcripts

Many of the metrics used by Mikado to evaluate and rank transcripts rely on the definition or their coding regions (CDS). It is therefore highly recommended to use an ORF predictor to define the coding regions for each transcript identified by mikado prepare. We directly support two different products:

  • Prodigal, a fast ORF predictor, capable of calculating thousands of ORFs in seconds. However, as it was originally developed for ORF calling in bacterial genomes, it may occasionally not provide the best possible answer.
  • TransDecoder, a slower ORF predictor that is however more specialised for eukaryotes.

For this tutorial we are going to use Prodigal. Using it is very straighforward:

prodigal -i mikado_prepared.fasta -g 1 -o mikado.orfs.gff3 -f gff

Warning

Prodigal by default uses the ‘Bacterial’ codon translation table, which is of course not appropriate at all for our eukariote genome. Therefore, it is essential to set -g 1 on the command line. By the same token, as prodigal normally would output the CDS prediction in GenBank format (currently not supported by Mikado), we have to instruct Prodigal to emit its CDS predictions in GFF format.

Mikado serialise

This step involves running mikado serialise to create a SQLite database with all the information that mikado needs to perform its analysis. As most of the parameters are already specified inside the configuration file, the command line is quite simple:

mikado serialise --json-conf configuration.yaml --xml mikado_prepared.blast.tsv --orfs mikado.orfs.gff3

–blast_targets uniprot_sprot_plants.fasta –junctions junctions.bed

After mikado serialise has run, it will have created two files:

  1. mikado.db, the SQLite database that will be used by pick to perform its analysis.
  2. serialise.log, the log of the run.

If you inspect the SQLite database mikado.db, you will see it contains nine different tables:

$ sqlite3 mikado.db ".tables"
chrom             hit               orf
external          hsp               query
external_sources  junctions         target

These tables contain the information coming from the genome FAI, the BLAST XML, the junctions BED file, the ORFs BED file, and finally the input transcripts and the proteins. There are two additional tables (external and external_sources) which in other runs would contain information on additional data, provided as tabular files.

For more details on the database structure, please refer to the section on this step in this documentation.

Mikado pick

Finally, during this step mikado pick will integrate the data present in the database with the positional and structural data present in the GTF file to select the best transcript models. The command line to be issued is the following:

mikado pick --configuration configuration.yaml --subloci_out mikado.subloci.gff3

At this step, we have to specify only some parameters for pick to function:

  • –configuration: the configuration file. This is the only compulsory option.
  • –subloci_out: the partial results concerning the subloci step during the selection process will be written to

mikado.subloci.gff3.

mikado pick will produce the following output files:

  • mikado.loci.gff3, mikado.loci.metrics.tsv, mikado.loci.scores.tsv: the proper output files. These contain the location of the selected transcripts, their metrics, and their scores. Please see this section for details.
  • mikado.subloci.gff3, mikado.subloci.metrics.tsv, mikado.subloci.scores.tsv: these files contain the same type of information as those above, but for the subloci stage. As such, all the transcripts in the input files are represented, not just those that are going to be selected as the best.
  • mikado_pick.log: the log file for this operation.
Comparing files with the reference

Finally, we can compare our files to the original reference annotation, and see how our results are compared to those. To do so, we will use Mikado compare. The first step is to index the reference annotation to make the comparisons faster:

mikado compare -r reference.gff3 --index

This will create a new file, reference.gff3.midx. If you inspect with eg. zless, you will notice it is a SQLite database, describing the locations and components of each gene on the annotation. Now that we have indexed the reference, we can perform the comparisons we are interested in:

  1. Reference vs. the input transcripts:
mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log;
  1. Reference vs. the subloci stage:
mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log;
  1. Reference vs the final output:
mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log;

Each of these comparisons will produce three files:

  • a tmap file, detailing the best match in the reference for each of the query transcripts;
  • a refmap file, detailing the best match among the query transcripts for each of the reference transcripts;
  • a stats file, summarising the comparisons.

The stats file for the input GTF should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log
7 reference RNAs in 5 genes
93 predicted RNAs in  64 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.97  29.39  45.00
            Exon level (stringent): 68.09  18.60  29.22
              Exon level (lenient): 90.91  31.25  46.51
                      Intron level: 94.74  45.57  61.54
                Intron chain level: 16.67  1.59  2.90
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 14.29  1.08  2.00
  Transcript level (>=80% base F1): 14.29  1.08  2.00
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 20.00  1.56  2.90
        Gene level (>=80% base F1): 20.00  1.56  2.90

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 1
             Matched intron chains: 1
   Matching monoexonic transcripts: 0
    Matched monoexonic transcripts: 0
        Total matching transcripts: 1
         Total matched transcripts: 1

          Missed exons (stringent): 15/47  (31.91%)
           Novel exons (stringent): 140/172  (81.40%)
            Missed exons (lenient): 4/44  (9.09%)
             Novel exons (lenient): 88/128  (68.75%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 43/79  (54.43%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/93  (25.81%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 21/64  (32.81%)

For the subloci file, where we still have all the transcripts but we have split obvious chimeras, it should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log
7 reference RNAs in 5 genes
105 predicted RNAs in  26 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.96  29.24  44.83
            Exon level (stringent): 70.21  19.08  30.00
              Exon level (lenient): 88.89  32.00  47.06
                      Intron level: 94.74  46.75  62.61
                Intron chain level: 33.33  3.17  5.80
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  9.52  14.29
  Transcript level (>=80% base F1): 42.86  11.43  18.05
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  7.69  12.90
        Gene level (>=80% base F1): 60.00  11.54  19.35

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 3
             Matched intron chains: 2
   Matching monoexonic transcripts: 9
    Matched monoexonic transcripts: 1
        Total matching transcripts: 12
         Total matched transcripts: 3

          Missed exons (stringent): 14/47  (29.79%)
           Novel exons (stringent): 140/173  (80.92%)
            Missed exons (lenient): 5/45  (11.11%)
             Novel exons (lenient): 85/125  (68.00%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 41/77  (53.25%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/105  (22.86%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 13/26  (50.00%)

A marked improvement can already be seen - we have now 105 transcripts instead of 93, and the total number of matching transcripts has increased from 1 to 3. Precision is still poor, however, as we have not discarded any transcript yet. Moreover, we have redundancy - 9 transcripts match the same monoexonic gene, and 3 transcripts match 2 intron chains in the reference. Finally, the comparison against the proper output (mikado.loci.gff3) should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in  8 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 85.74  64.73  73.77
            Exon level (stringent): 63.83  42.86  51.28
              Exon level (lenient): 80.00  52.94  63.72
                      Intron level: 89.47  59.65  71.58
                Intron chain level: 33.33  14.29  20.00
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  13.33  18.18
  Transcript level (>=80% base F1): 42.86  20.00  27.27
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  25.00  30.77
        Gene level (>=80% base F1): 60.00  37.50  46.15

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 2
             Matched intron chains: 2
   Matching monoexonic transcripts: 1
    Matched monoexonic transcripts: 1
        Total matching transcripts: 3
         Total matched transcripts: 3

          Missed exons (stringent): 17/47  (36.17%)
           Novel exons (stringent): 40/70  (57.14%)
            Missed exons (lenient): 9/45  (20.00%)
             Novel exons (lenient): 32/68  (47.06%)
                    Missed introns: 4/38  (10.53%)
                     Novel introns: 23/57  (40.35%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 6/15  (40.00%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 2/8  (25.00%)

After selecting the best transcripts in each locus, Mikado has discarded most of the incorrect transcripts while retaining most of the correct information; this can be seen in the increase in precision at eg. the nucleotide level (from 30% to 65%). The number of genes has also decreased, as Mikado has discarded many loci whose transcripts are just UTR fragments of neighbouring correct genes.

Analysing the tutorial data with Snakemake

The workflow described in this tutorial can be executed automatically using Snakemake [Snake] with this Snakefile. Just execute:

snakemake

in the directory where you have downloaded all of the tutorial files. In graph representation, this is how the pipeline looks like:

_images/snakemake_dag.svg
Tutorial for Daijin

This tutorial will guide you through the task of configuring and running the whole Daijin pipeline on a Drosophila melanogaster dataset comprising two different samples, using one aligner (HISAT) and two assemblers (Stringtie and CLASS2) as chosen methods. A modern desktop computer with a multicore processor and 4GB of RAM or more should suffice to execute the pipeline within two hours.

Warning

Please note that development of Daijin Assemble has now been discontinued. Daijin will be superseded by a different pipeline manager, which is currently in the works. We will continue the active maintenance the “mikado” part of the pipeline, which is dedicated to run the steps between a set of input transcript assemblies and/or cDNA alignments until the final Mikado output.

Overview

The tutorial will guide you through the following tasks:

  1. Configuring Daijin to analyse the chosen data
  2. Running the alignment and assemblies using daijin assemble
  3. Running the Mikado analysis using daijin mikado
  4. Comparing the results against the reference transcriptome
Required software

Mikado should be installed and configured properly (see our installation instructions). Additionally, you should have the following software tools at disposal (between brackets is indicated the version used at the time of the writing):

  • DIAMOND (v0.8.22 or later)
  • Prodigal (v2.6.3 or later)
  • Portcullis (v1.0.2 or later)
  • HISAT2 (v2.0.4)
  • Stringtie (v1.2.4)
  • CLASS2 (v2.12)
  • SAMtools (v1.1 or later)
Input data

Throughout this tutorial, we will use data coming from EnsEMBL v89, and from the PRJEB15540 experiment on ENA. In particular, we will need:

Preparation of the input data

First of all, let us set up a folder with the reference data:

mkdir -p Reference;
cd Reference;
wget ftp://ftp.ensembl.org/pub/release-89/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.89.gtf.gz;
wget ftp://ftp.ensembl.org/pub/release-89/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.dna.toplevel.fa.gz;
wget "http://www.uniprot.org/uniprot/?sort=score&desc=&compress=yes&query=taxonomy:diptera%20NOT%20taxonomy:%22Drosophila%20(fruit%20flies)%20[7215]%22%20AND%20taxonomy:%22Aedes%20aegypti%22&fil=&format=fasta&force=yes" -O Aedes_aegypti.fasta.gz;
gunzip *gz;
cd ../;

The snippet of the bash script above will create a “Reference” directory, download the genome of D. melanogaster in FASTA file, the corresponding GTF, and the protein sequences for Aedes aegypti. It will also decompress all files.

It is possible to have a feel for the annnotation of this species - the size of its genes and transcripts, the average number of exons per transcript, etc - by using mikado util stats; just issue the following command:

mikado util stats Reference/Drosophila_melanogaster.BDGP6.89.gtf Reference/Drosophila_melanogaster.BDGP6.89.stats.txt 2> Reference/stats.err

These are the results:

Stat Total Average Mode Min 1% 5% 10% 25% Median 75% 90% 95% 99% Max
Number of genes 17559 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of genes (coding) 13898 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of monoexonic genes 4772 NA NA NA NA NA NA NA NA NA NA NA NA NA
Transcripts per gene 34740 1.98 1 1 1 1 1 1 1 2 4 5 10 75
Coding transcripts per gene 30308 1.73 1 0 0 0 0 1 1 2 4 5 10 69
CDNA lengths 90521553 2,605.69 22 20 58 322 526 988 1,888 3,373 5,361 7,025 11,955 71,382
CDNA lengths (mRNAs) 87158361 2,875.75 1023 108 379 571 748 1,247 2,147 3,628 5,636 7,312 12,444 71,382
CDS lengths 59968005 1,726.19 0 0 0 0 0 525 1,212 2,142 3,702 5,106 9,648 68,847
CDS lengths (mRNAs) NA 1,978.62 372 33 177 315 435 780 1,404 2,361 3,915 5,480 10,149 68,847
CDS/cDNA ratio NA 67.88 75.0 1 14 29 40 56 71 83 91 94 99 100
Monoexonic transcripts 5899 882.90 22 20 21 25 73 288 640 1,154 1,898 2,470 4,517 21,216
MonoCDS transcripts 4269 912.26 372 33 114 186 249 402 723 1,227 1,789 2,153 3,721 9,405
Exons per transcript 186414 5.37 1 1 1 1 1 2 4 7 11 15 24 82
Exons per transcript (mRNAs) 3715 5.93 2 1 1 1 2 3 4 8 12 15 26 82
Exon lengths NA 485.59 156 1 33 69 93 145 251 554 1,098 1,606 3,303 28,074
Exon lengths (mRNAs) NA 485.18 156 1 36 70 95 146 250 554 1,102 1,612 3,277 28,074
Intron lengths NA 1,608.99 61 2 52 55 58 63 102 742 3,482 7,679 25,852 257,022
Intron lengths (mRNAs) NA 1,597.85 61 23 52 55 58 63 102 744 3,472 7,648 25,529 257,022
CDS exons per transcript 2761 4.60 2 0 0 0 0 1 3 6 10 13 23 81
CDS exons per transcript (mRNAs) 2761 5.27 2 1 1 1 1 2 4 7 11 14 24 81
CDS exon lengths 59968005 375.21 156 1 12 48 73 123 197 407 848 1,259 2,523 27,705
CDS Intron lengths 165622620 1,278.75 60 22 51 54 56 61 81 525 2,503 5,629 21,400 257,021
5’UTR exon number 30308 1.52 1 0 0 1 1 1 1 2 2 3 4 13
3’UTR exon number 30308 1.11 1 0 1 1 1 1 1 1 1 2 4 29
5’UTR length 9330312 307.85 0 0 0 21 39 87 185 407 702 960 1,684 5,754
3’UTR length 17860044 589.28 3 0 3 45 68 126 299 724 1,398 2,079 4,048 18,497
Stop distance from junction NA 31.30 0 0 0 0 0 0 0 0 0 23 1,027 10,763
Intergenic distances NA 1,991.60 235 -1,472,736 -42,934 -6,222 -1,257 37 335 1,894 9,428 18,101 47,477 1,125,562
Intergenic distances (coding) NA 2,842.58 1 -351,626 -38,336 -4,836 -316 58 347 1,798 10,218 21,258 56,471 932,526

From this summary it is quite apparent that the D. melanogaster genome preferentially encodes multiexonic transcripts, which on average have ~30% of their sequence in UTRs. Intron lengths are generally over 1.5 kbps, with a very long maximum value of approximately 257 kbps. Genes on average are separated by 3 kbps stretches of intergenic regions, although there is considerable variation, with over 25% of the genes overlapping one another.

Next, we download the reads that we will use for this example:

mkdir -p Reads;
cd Reads;
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/003/ERR1662533/ERR1662533_1.fastq.gz;
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/003/ERR1662533/ERR1662533_2.fastq.gz;
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1662534/ERR1662534_1.fastq.gz;
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1662534/ERR1662534_2.fastq.gz;
cd ../;

These files have a total file size of approximately 4GB, so they might take five to ten minutes to download, depending on your connection speed.

Step 1: configuring Daijin

The first task is to create a configuration file for Daijin using daijin configure. On the command line, we have to configure the following:

  • name of the configuration file (daijin.yaml)
  • number of threads per process (2)
  • reference genome and the name of the species (without spaces or non-ASCII/special characters)
  • reads to use, with their strandedness and the name of the sample
  • aligners (HISAT2) and assemblers (Stringtie, CLASS2)
  • output directory (Dmelanogaster)
  • the scoring file to use for Mikado pick; we will ask to copy it in-place to have a look at it (dmelanogaster_scoring.yaml)
  • the protein database for homology searches for Mikado (Aedes_aegypti.fasta)
  • flank: as D. melanogaster has a relatively compact genome, we should decrease the maximum distance for grouping together transcripts. We will decrease from 1kbps (default) to 500.
  • (optional) the scheduler to use for the cluster (we will presume that the job is being executed locally)
  • (optional) name of the cluster configuration file, which will have to be edited manually.

First, we will create a sample sheet, containing the information of the sample that we are going to use. This is a tab-delimited text file, where each line defines a single sample, and with up to 5 columns per line. The first three columns are mandatory, while the last two are optional. The columns are as follows:

  • Read1: required. Location of the left reads for the sample.

  • Read2: optional, location of the right reads for the sample if it is paired.

  • Sample: name of the sample. Required.

  • Strandedness: strandedness of the sample. It can be one of:

    • fr-unstranded (Unstranded data)
    • fr-firststrand (Stranded data, first read forward, second read reverse)
    • fr-secondstrand (Stranded data, second read forward, first read reverse)
    • f (Forward, single read only)
    • r (Reverse, single read only)
  • Long read sample: Boolean flag. If set to “True”, the sample will be considered as coming from a non-second generation sequencing platform (eg. Sanger ESTs or PacBio IsoSeq) and for its reads we would therefore consider only the alignment, without performing any assembly.

For our example, therefore, the sample sheet will look like this:

Reads/ERR1662533_1.fastq.gz Reads/ERR1662533_2.fastq.gz     ERR1662533      fr-unstranded   False
Reads/ERR1662534_1.fastq.gz Reads/ERR1662534_2.fastq.gz     ERR1662534      fr-unstranded   False

Write this into a text file called “sample_sheet.tsv”. Now we will configure Daijin for the run:

daijin configure --scheduler "local" \
     --scoring HISTORIC/dmelanogaster_scoring.yaml \
     --copy-scoring dmelanogaster_scoring.yaml \
     -m permissive --sample-sheet sample_sheet.tsv \
     --flank 500 -i 50 26000 --threads 2 \
     --genome Reference/Drosophila_melanogaster.BDGP6.dna.toplevel.fa \
     -al hisat -as class stringtie -od Dmelanogaster --name Dmelanogaster \
     -o daijin.toml --prot-db Reference/Aedes_aegypti.fasta;

This will create three files in the working directory:

  • daijin.yaml, the main configuration file. This file is in TOML format.
  • dmelanogaster.yaml: this file is a copy of the scoring configuration file. Please refer to the dedicated section for details.
  • daijin_exe.yaml: this small configuration file contains the instruction to load the necessary software into the working environment. Ignore it if you are working on a local machine. If you are working within a cluster environment, please modify this file with the normal commands you would use in a cluster script to load necessary software. For example, this is how this file looks like on our own cluster system:
blast: 'source blast-2.3.0'
class: 'source class-2.12'
cufflinks: ''
gmap: ''
hisat: 'source HISAT-2.0.4'
mikado: 'source mikado-1.1'
portcullis: 'source portcullis-0.17.2'
samtools: 'source samtools-1.2'
star: ''
stringtie: 'source stringtie-1.2.4'
tophat: ''
transdecoder: 'source transdecoder-3.0.0'
trinity: ''

Important

If you are operating on a cluster, instead of a local machine, you will need to specify the scheduler type. Currently we support SLURM, PBS and LSF. Add the following switch to the configure command above:

--scheduler <One of local, SLURM, PBS or LSF>

Adding this switch will also create a default cluster configuration file, daijin_hpc.yaml, specifying the number of resources per job and the submission queue. This is an example of how it appears on our system:

__default__:
    threads: 4
    memory: 10000
asm_trinitygg:
    memory: 20000
bam_sort:
    memory: 20000
Step 2: running the assemble part

Now that we have created a proper configuration file, it is time to launch Daijin assemble and inspect the results. Issue the command:

daijin assemble --cores <Number of maximum cores> daijin.toml

After checking that the configuration file is valid, Daijin will start the alignment and assembly of the dataset. On a normal desktop computer, this should take less than 2 hours. Before launching the pipeline, you can obtain a graphical representation of the steps with:

daijin assemble --dag daijin.toml | dot -Tsvg > assemble.svg
schematic diagram of the assembling pipeline

You can also ask Daijin to display the steps to be executed, inclusive of their command lines, by issuing the following command:

daijin assemble --dryrun daijin.toml

When Daijin is finished, have a look inside the folder Dmelanogaster/3-assemblies/output/; you will find the following GTF files:

  • Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662533-0.gtf
  • Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662534-0.gtf
  • Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662533-0.gtf
  • Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662534-0.gtf

These are standard GTF files reporting the assembled transcripts for each method. We can have a feel for how they compare with our reference annotation by, again, using mikado util stats. Conveniently, Daijin has already performed this analysis for us, and the files will be present in the same folder:

  • Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662533-0.gtf.stats
  • Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662534-0.gtf.stats
  • Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662533-0.gtf.stats
  • Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662534-0.gtf.stats

Daijin has also created a summary of these statistics in Dmelanogaster/3-assemblies/assembly.stats:

File genes monoexonic_genes transcripts transcripts_per_gene transcript_len_mean monoexonic_transcripts exons exons_per_transcript exon_len_mean
class-0-hisat-ERR1662533-0.gtf.stats 11535 852 13958 1.2 1429.86 852 57554 4.12 346.77
class-0-hisat-ERR1662534-0.gtf.stats 11113 884 13341 1.19 1448.07 884 55123 4.13 350.47
stringtie-0-hisat-ERR1662533-0.gtf.stats 15490 6252 18995 1.22 1668.17 6446 63798 3.36 496.67
stringtie-0-hisat-ERR1662534-0.gtf.stats 13420 4823 16604 1.23 1775.57 4985 58841 3.54 501.04

From this quick analysis, it looks like StringTie assembled many more transcripts, with a marked prevalence of monoexonic transcripts when compared to CLASS2. StringTie transcripts also tend to be longer than those created by CLASS2, but with a slightly lower number of exons. We can assess the similarity of these assemblies to the reference annotation by using Mikado Compare:

# First index the reference GFF3
mkdir -p Comparisons;
mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf --index -l Reference/index.log;
# Compare the CLASS2 assemblies against the reference
mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/class_ERR1662533.log -o Comparisons/class_ERR1662533 -p Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662533-0.gtf;
mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/class_ERR1662534.log -o Comparisons/class_ERR1662534 -p Dmelanogaster/3-assemblies/output/class-0-hisat-ERR1662534-0.gtf;
# Compare the StringTie assemblies against the reference:
mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/stringtie_ERR1662534.log -o Comparisons/stringtie_ERR1662534 -p Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662534-0.gtf;
mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/stringtie_ERR1662533.log -o Comparisons/stringtie_ERR1662533 -p Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662533-0.gtf;

The analysis will produce TMAP, REFMAP and STATS files for each of the assemblies. As an example, this is the statistics file for the StringTie assembly of the ERR1662533 sample:

Command line:
/usr/users/ga002/venturil/miniconda3/envs/py360/bin/mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/stringtie_ERR1662533.log -o Comparisons/stringtie_ERR1662533 -p Dmelanogaster/3-assemblies/output/stringtie-0-hisat-ERR1662533-0.gtf
34740 reference RNAs in 17559 genes
18995 predicted RNAs in  15490 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 50.64  78.55  61.58
            Exon level (stringent): 26.12  43.71  32.70
              Exon level (lenient): 50.36  84.47  63.10
                      Intron level: 52.58  96.15  67.98
                Intron chain level: 25.85  53.17  34.79
      Transcript level (stringent): 0.01  0.01  0.01
  Transcript level (>=95% base F1): 13.28  22.47  16.69
  Transcript level (>=80% base F1): 22.03  33.58  26.61
         Gene level (100% base F1): 0.01  0.01  0.01
        Gene level (>=95% base F1): 22.01  24.51  23.19
        Gene level (>=80% base F1): 32.35  36.02  34.09

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 6675
             Matched intron chains: 8285
   Matching monoexonic transcripts: 583
    Matched monoexonic transcripts: 722
        Total matching transcripts: 7258
         Total matched transcripts: 9007

          Missed exons (stringent): 61670/83470  (73.88%)
           Novel exons (stringent): 28072/49872  (56.29%)
            Missed exons (lenient): 36316/73155  (49.64%)
             Novel exons (lenient): 6773/43612  (15.53%)
                    Missed introns: 28532/60166  (47.42%)
                     Novel introns: 1266/32900  (3.85%)

                Missed transcripts: 9917/34740  (28.55%)
                 Novel transcripts: 2058/18995  (10.83%)
                      Missed genes: 7560/17559  (43.05%)
                       Novel genes: 2023/15490  (13.06%)
Step 3: running the Mikado steps

Now that we have created the input assemblies, it is time to run Mikado. First of all, let us have a look at the dmelanogaster_scoring.yaml file, which we have conveniently copied to our current directory:

With this file, we are telling Mikado that we are looking for transcripts with three or more exons, a cDNA to CDS ratio of 80%, a complete ORF (“is_complete: true”), a number of 5’ and 3’ UTR exons of two and one, respectively, and the minimum amount possible of sequence contained within putative retained introns.

Using this file, and the configuration file created by the assemble step, it is now possible to run the Mikado part of the pipeline. Please note that we are now going to use the file Dmelanogaster/mikado.yaml.

It is possible to visualise the steps in this part of the pipeline in the following way:

daijin mikado --dag Dmelanogaster/mikado.yaml | dot -Tsvg > mikado.svg
_images/mikado_pipeline.png

Now issue the command:

daijin mikado Dmelanogaster/mikado.yaml

This part of the pipeline should be quicker than the previous stage. After the pipeline is finished, Daijin will have created the final output files in Dmelanogaster/5-mikado/pick/. As we requested only for the permissive mode, we only have one output - Dmelanogaster/5-mikado/pick/mikado-permissive.loci.gff3. These are basic statistics on this annotation:

File genes monoexonic_genes transcripts transcripts_per_gene transcript_len_mean monoexonic_transcripts exons exons_per_transcript exon_len_mean
mikado-permissive.loci.stats 11114 1608 14796 1.33 1968.19 1642 67262 4.55 432.95

Mikado has created an annotation that is in between those produced by Stringtie and CLASS2. Compared to CLASS2, the Mikado assembly has approximately the same number of genes, but a higher number of transcripts. While the average cDNA length is similar to that in Stringtie (~2 kbps), the average number of exons is higher, reflecting the sharp drop in the number of monoexonic transcripts and genes.

We can verify the correspondence of the Mikado annotation with the reference with the following command:

mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/mikado.log -o Comparisons/mikado -p Dmelanogaster/5-mikado/pick/permissive/mikado-permissive.loci.gff3

A cursory look at the STATS file shows that Mikado was able to improve on the recall of either method, while improving on the overall precision:

Command line:
/usr/users/ga002/venturil/miniconda3/envs/py360/bin/mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -l Comparisons/mikado.log -o Comparisons/mikado -p Dmelanogaster/5-mikado/pick/permissive/mikado-permissive.loci.gff3
34740 reference RNAs in 17559 genes
14796 predicted RNAs in  11114 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 50.98  90.99  65.34
            Exon level (stringent): 28.78  48.16  36.03
              Exon level (lenient): 54.03  86.24  66.44
                      Intron level: 56.96  95.42  71.34
                Intron chain level: 29.06  56.99  38.49
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 12.47  27.45  17.15
  Transcript level (>=80% base F1): 23.57  47.32  31.46
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 20.86  32.86  25.52
        Gene level (>=80% base F1): 34.35  54.10  42.02

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 7497
             Matched intron chains: 9137
   Matching monoexonic transcripts: 551
    Matched monoexonic transcripts: 657
        Total matching transcripts: 8048
         Total matched transcripts: 9794

          Missed exons (stringent): 59450/83470  (71.22%)
           Novel exons (stringent): 25860/49880  (51.84%)
            Missed exons (lenient): 33906/73755  (45.97%)
             Novel exons (lenient): 6357/46206  (13.76%)
                    Missed introns: 25893/60166  (43.04%)
                     Novel introns: 1646/35919  (4.58%)

                Missed transcripts: 10239/34740  (29.47%)
                 Novel transcripts: 200/14796  (1.35%)
                      Missed genes: 7904/17559  (45.01%)
                       Novel genes: 185/11114  (1.66%)

Moreover, Mikado models have an ORF assigned to them. We can ask Mikado compare to consider only the coding component of transcripts with the following command line:

mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -eu -l Comparisons/mikado_eu.log -o Comparisons/mikado_eu -p Dmelanogaster/5-mikado/pick/permissive/mikado-permissive.loci.gff3

The statistics file looks as follows:

Command line:
/usr/users/ga002/venturil/miniconda3/envs/py360/bin/mikado compare -r Reference/Drosophila_melanogaster.BDGP6.89.gtf -eu -l Comparisons/mikado_eu.log -o Comparisons/mikado_eu -p Dmelanogaster/5-mikado/pick/permissive/mikado-permissive.loci.gff3
34740 reference RNAs in 17559 genes
14796 predicted RNAs in  11114 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 58.05  95.34  72.16
            Exon level (stringent): 39.69  62.20  48.46
              Exon level (lenient): 58.02  87.76  69.86
                      Intron level: 61.73  96.41  75.27
                Intron chain level: 36.60  63.36  46.40
      Transcript level (stringent): 0.01  0.03  0.02
  Transcript level (>=95% base F1): 36.28  58.18  44.69
  Transcript level (>=80% base F1): 38.61  61.66  47.48
         Gene level (100% base F1): 0.02  0.04  0.03
        Gene level (>=95% base F1): 38.12  60.26  46.70
        Gene level (>=80% base F1): 40.56  64.12  49.69

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 8278
             Matched intron chains: 12123
   Matching monoexonic transcripts: 1011
    Matched monoexonic transcripts: 1551
        Total matching transcripts: 9289
         Total matched transcripts: 13674

          Missed exons (stringent): 41370/68594  (60.31%)
           Novel exons (stringent): 16541/43765  (37.80%)
            Missed exons (lenient): 26319/62698  (41.98%)
             Novel exons (lenient): 5075/41454  (12.24%)
                    Missed introns: 18890/49364  (38.27%)
                     Novel introns: 1136/31610  (3.59%)

                Missed transcripts: 10558/34740  (30.39%)
                 Novel transcripts: 325/14796  (2.20%)
                      Missed genes: 8109/17559  (46.18%)
                       Novel genes: 280/11114  (2.52%)

The similarity is quite higher, suggesting that for many models the differences between the Mikado annotation and the reference lies in the UTR component.

When plotted, the advantage of using Mikado to refine the transcript assemblies is quite evident:

_images/daijin_result.png

We suggest to visualise assemblies with one of the many tools currently at disposal, such as eg WebApollo [Apollo]. Mikado files are GFF3-compliant and can be loaded directly into Apollo or similar tools. GTF files can be converted into proper GFF3 files using the convert utility included in Mikado:

mikado util convert <input gtf> <output GFF3>
Tutorial: how to create a scoring configuration file

The current version of Mikado relies upon the experimenter to determine what desirable transcripts should look like, and how to prioritise across different possible isoforms. These instructions are provided through scoring configuration files, whose detailed description can be found in the general documentation. In this section, we will expose a general way to create your own configuration file.

The purpose of scoring files: introductory note

Mikado is an annotation pipeline, and as such, it does not have, at its core, the mission to represent every possible expressed transcript. Some sequenced transcripts may originate from transcriptional errors, exist only transiently or may have no functional role. When performing a sequencing experiment, we create a “snapshot” of the transient transcriptional state of the cell: inclusive of erroneous events, along with immature transcripts and artifacts arising from misassemblies, fragmentation during library construction, genomic DNA contamination, etc.

In Mikado, we make a choice regarding what transcripts we want to retain in our annotation (as is the case with any genome annotation). For example, as you will see in this tutorial, the standard configuration penalises transcripts with retained CDS introns, and transcripts that are NMD targets. This choice does not mean that we think that these transcripts are artifacts; rather, it signifies that we prioritise those transcripts that are more likely to represent functionally active genes. Annotators will differ on this point: for example, the human reference annotation retains and marks these events, rather than discarding them.

Mikado allows the experimenter to make these choices simply and transparently. Our pre-defined configuration files strive to replicate the choices made by annotators over the years, and thus allow to replicate - as much as possible - the reference annotations of various species starting from partial sequencing data. However, if as an experimenter you are interested in a more stringent approach - say, only coding transcripts with a complete ORF and a UTR - or you would like instead to perform only a minimal cleaning up - say, just discarding obvious NMD targets - Mikado will allow you to do so. This tutorial will show you how.

Obtaining pre-defined scoring files

Mikado comes with two pre-packaged scoring files, one for plant species as well as another generic for mammalian species. These can be found in the installation directory, under “configuration/scoring_files”; or, when launching mikado configure, you can request to copy a template file to the working directory (”–copy-scoring” command flag). In the rest of the tutorial, however, we will presume that no suitable scoring file exists and that a new one has to be created from scratch.

First step: defining transcript requirements

The first step in the process is to define the minimum attributes of a transcript that should be retained in your annotation. For example, if the experimenter is interested only in coding transcripts and would like to ignore any non-coding RNA, this strict requirement would have to be encoded in this section. Transcripts that do not pass this preliminary filter are completely excluded from any subsequent analysis.

In general, this filter should be very gentle - being too stringent at this stage risks completely throwing out all the transcripts present in a given locus. It is rather preferable to strongly penalise dubious transcripts later at the scoring stage, so that they will be kept in the final annotation if no alternative is present, but discarded in most cases as there will be better alternatives.

The requirements block is composed of two sub-sections:

  • a parameters section, which details which metrics will have to be evaluated, and according to which operators.
  • an expression section, detailing how the various parameters have to be considered together.

As an example, let us imagine a quite stringent experiment in which we are interested only in transcripts that respect the following conditions:

  • minimum transcript length of 300 bps: - cdna_length: {operator: ge, value: 300}
  • if they are multiexonic, at least one of their junctions must be validated by a junction checker such as Portcullis - exon_num: {operator: ge, value: 2} - verified_introns_num: {operator: gt, value: 0}
  • if they are multiexonic, their introns must be within 5 and 2000 bps: - exon_num: {operator: ge, value: 2} - min_intron_length: {operator: ge, value: 5} - max_intron_length: {operator: le, value: 2000}
  • if they are monoexonic, they must be coding: - exon_num: {operator: eq, value: 1} - combined_cds_length: {operator: gt, value: 0}

Having defined the parameters, we can now put them together in an expression:

expression:
    cdna_length and ((exon_num and verified_introns_num and min_intron_length and max_intron_length) or (exon_num and combined_cds_length))

Notice that we have used a property twice (exon_num). This would confuse Mikado and cause an error. In order to tell the program that these are actually two different values, we can prepend a suffix, starting with a mandatory dot sign:

  • exon_num.multi: {operator: ge, value: 2}
  • exon_num.mono: {operator: eq, value: 1}

Note

The suffix must be ASCII alphanumeric in character. Therefore, the following suffices are valid:

  • .mono
  • .mono1
  • .mono_first
The expression now becomes:
    cdna_length and ((exon_num.multi and verified_introns_num and min_intron_length and max_intron_length) or (exon_num.mono and combined_cds_length))

Warning

if no expression is provided, Mikado will create a default one by connecting all the parameters with an and. This will make life simpler for simple cases (e.g. we only have a couple of parameters we want to check). In complex, conditional scenarios like this one, however, this might well lead to discarding all transcripts!

Putting it all together, this is how the section in the configuration file would look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
requirements:
    expression:
    - cdna_length and ((exon_num and verified_introns_num and min_intron_length
    - and max_intron_length) or (exon_num and combined_cds_length))
    parameters:
    - cdna_length: {operator: ge, value: 300}
    - exon_num.multi:  {operator: ge, value: 2}
    - verified_introns_num: {operator: gt, value: 0}
    - min_intron_length: {operator: ge, value: 5}
    - max_intron_length: {operator: le, value: 2000}
    - exon_num.mono: {operator: eq, value: 1}
    - combined_cds_length: {operator: gt, value: 0}

Warning

The example in this section is more stringent than the standard selection provided by the included scoring files.

Second step: prioritising transcripts

After removing transcripts which are not good enough for our annotation, Mikado will analyse any remaining models and assign each a score. How to score models in Mikado is, explicitly, a procedure left to the experimenter, to allow specific tailoring for each different species. In our own experiments, we have abided by the following principles:

  1. Good transcripts should preferentially be protein coding, with homology to known proteins in other species, and sport both start and stop codon.
  2. Good coding transcripts should contain only one ORF, not multiple; if they have multiple, most of the CDS should be within the primary.
  3. The total length of the CDS should be within 60 and 80% of the transcript length, ideally (with the value changing by species, on the basis of available data).
  4. All else equal, good coding transcripts should have a long ORF, contain most of the coding bases in the locus, and have that most of their introns are between coding exons.
  5. All else equal, good transcripts should be longer and have more exons; however, there should be no preference between mono- and di-exonic transcripts.
  6. Good coding transcripts should have a defined UTR, on both sides; however, if the UTR goes beyond a certain limit, the transcript should be penalised instead. For 5’UTR, we preferentially look at transcripts with at most four UTR exons, and preferentially two, for a total length of ideally 100 bps and maximally of 2500. For 3’UTR, based on literature and the phenomenon of nonsense mediated decay (NMD), we look for transcripts with at most two UTR exons and ideally one; the total length of this UTR should be ideally of 200 bps, and at most of 2500.
  7. Multiexonic transcripts should have at least some of their junctions confirmed by Portcullis, ideally all of them. Ideally and all else equal, they should contain all of the verified junctions in the locus.
  8. The distance between the stop codon and the last junction in the transcript should be the least possible, and in any case, not exceed 55 bps (as discovered by studies on NMD).

The first step is to associate each of these requirements with the proper metric. In order:

  1. Good transcripts should preferentially be protein coding, with a good BLAST coverage of homologous proteins, and sport start and stop codon:

    • snowy_blast_score: look for the maximum value
    • is_complete: look for true
    • has_start_codon: look for true
    • has_stop_codon: look for true

Looking at the documentation on scoring files, we can write it down thus:

- snowy_blast_score: {rescaling: max}
- is_complete: {rescaling: target, value: true}
- has_start_codon: {rescaling: target, value: true}
- has_stop_codon: {rescaling: target, value: true}

Applying the same procedure to the rest of the conditions:

  1. Good coding transcripts should contain only one ORF, not multiple; if they have multiple, most of the CDS should be within the primary.

    • number_internal_orfs: look for a target of 1
    • cds_not_maximal: look for the minimum value
    • cds_not_maximal_fraction: look for the minimum value
- number_internal_orfs: {rescaling: target, value: 1}
- cds_not_maximal: {rescaling: min}
- cds_not_maximal_fraction: {rescaling: min}
  1. The total length of the CDS should be within 60 and 80% of the transcript length, ideally (with the value changing by species, on the basis of available data).

    • selected_cds_fraction: look for a target of x (where x depends on the species and is between 0 and 1), for example, let us set it to 0.7
- selected_cds_fraction: {rescaling: target, value: 0.7}
  1. All else equal, good coding transcripts should have a long ORF, contain most of the coding bases in the locus, and have that most of their introns are between coding exons.

    • cdna_length: look for the maximum value
    • selected_cds_length: look for the maximum value
    • selected_cds_intron_fraction: look for the maximum value
- selected_cds_length: {rescaling: max}
- selected_cds_intron_fraction: {rescaling: max}
- selected_cds_intron_fraction: {rescaling: max}
  1. All else equal, good transcripts should be longer and have more exons; however, there should be no preference between mono- and di-exonic transcripts.

    • cdna_length: look for the maximum value
    • exon_num: look for the maximum value, ignore for any transcript with one or two exons.
- cdna_length: {rescaling: max}
- exon_num: {rescaling: max, filter: {operator: ge, value: 3}
  1. Good coding transcripts should have a defined UTR, on both sides; however, if the UTR goes beyond a certain limit, the transcript should be penalised instead.

    • For 5’UTR, we preferentially look at transcripts with at most three UTR exons, and preferentially two, for a total length of ideally 100 bps and maximally of 2500.

      • five_utr_num: look for a target of 2, ignore anything with four or more 5’ UTR exons
      • five_utr_length: look for a target of 100, ignore anything with 2500 or more bps
    • For 3’UTR, based on literature and the phenomenon of nonsense mediated decay (NMD), we look for transcripts with at most two UTR exons and ideally one; the total length of this UTR should be ideally of 200 bps, and at most of 2500.

      • three_utr_num: look for a target of 1, ignore anything with three or more 3’UTR exons
      • three_utr_length: look for a target of 200, ignore anything with 2500 bps or more
- five_utr_num: {rescaling: target, value: 2, filter: {operator: lt, value: 4}}
- five_utr_length: {rescaling: target, value: 100, filter: {operator: le, value: 2500}}
- three_utr_num: {rescaling: target, value: 1, filter: {operator: lt, value: 3}}
- three_utr_length: {rescaling: target, value: 200, filter: {operator: lt, value: 2500}}
  1. Multiexonic transcripts should have at least some of their junctions confirmed by Portcullis, ideally all of them. Ideally and all else equal, they should contain most of the verified junctions in the locus.

    • proportion_verified_introns_inlocus: look for the maximum value
    • non_verified_introns_num: look for the minimum value
- proportion_verified_introns_inlocus: {rescaling: max}
- non_verified_introns_num: {rescaling: min}
  1. The distance between the stop codon and the last junction in the transcript should be the least possible, and in any case, not exceed 55 bps (as discovered by studies on NMD).

    • end_distance_from_junction: look for the minimum value, discard anything over 55
- end_distance_from_junction: {rescaling: min, filter: {operator: lt, value: 55}}

Putting everything together:

scoring:
    - snowy_blast_score: {rescaling: max}
    - is_complete: {rescaling: target, value: true}
    - has_start_codon: {rescaling: target, value: true}
    - has_stop_codon: {rescaling: target, value: true}
    - number_internal_orfs: {rescaling: target, value: 1}
    - cds_not_maximal: {rescaling: min}
    - cds_not_maximal_fraction: {rescaling: min}
    - selected_cds_fraction: {rescaling: target, value: 0.7}
    - selected_cds_length: {rescaling: max}
    - selected_cds_intron_fraction: {rescaling: max}
    - selected_cds_intron_fraction: {rescaling: max}
    - cdna_length: {rescaling: max}
    - exon_num: {rescaling: max, filter: {operator: ge, value: 3}
    - five_utr_num: {rescaling: target, value: 2, filter: {operator: lt, value: 4}}
    - five_utr_length: {rescaling: target, value: 100, filter: {operator: le, value: 2500}}
    - three_utr_num: {rescaling: target, value: 1, filter: {operator: lt, value: 3}}
    - three_utr_length: {rescaling: target, value: 200, filter: {operator: lt, value: 2500}}
    - proportion_verified_introns_inlocus: {rescaling: max}
    - non_verified_introns_num: {rescaling: min}
    - end_distance_from_junction: {rescaling: min, filter: {operator: lt, value: 55}}
Third step: defining acceptable alternative splicing events

After selecting a primary transcript for the locus, we have to define what would make a transcript inherently unacceptable as an alternative splicing event. This is done in a similar way to how we defined the minimal requirements for all transcripts.

Warning

Keep in mind that this section defines the inherent requirements. Relative requirements, such as acceptable class codes, percentage of the score of the primary transcript, etc., are defined in the general configuration file, under the “alternative_splicing” section. By default, we also control whether to accept or refuse retained intron events there, rather than here.

Throughout our experiments, we have defined this section quite gently; potential candidates are discarded more due to their relationship to the primary transcript (class code, score percentage, etc.) rather than due to some inherent defect. This is how we generally selected:

  • Minimum cDNA length of 200
  • Combined UTR length less than 2500 bps
  • No suspicious splicing event (ie junctions that would be canonical if ported on the opposite strand)
as_requirements:
  expression: [cdna_length and three_utr_length and five_utr_length and utr_length and suspicious_splicing]
  parameters:
    cdna_length: {operator: ge, value: 200}
    utr_length: {operator: le, value: 2500}
    five_utr_length: {operator: le, value: 2500}
    three_utr_length: {operator: le, value: 2500}
    suspicious_splicing: {operator: ne, value: true}
Fourth step: defining potential fragments

The final step in the selection process is to detect and discard potential transcript fragments present in the neighbourhood of good loci. Usually these originate by mismappings or polymerase run-ons, and can be easily identified “by eye” as short, non- or minimally coding transcripts near better looking loci. Mikado will use the requirements defined in this section to identify such spurious loci, and discard them.

Note

The maximum distance between loci, for them to be considered for this step, is defined in the general configuration file by the “flank” parameter. Any locus beyond this distance will not be evaluated as a potential fragment.

For our experiments, in general, this is how we defined potential fragments:

  • If multiexonic:

    • Shorter than 300 bps
    • Or with an ORF shorter than 300 bps
  • If monoexonic:

    • Non-coding and without any BLAST homology
    • Coding with an ORF lower than 600 bps

In the format understood by Mikado:

not_fragmentary:
    expression: [((exon_num.multi and (cdna_length.multi or selected_cds_length.multi)), or, (exon_num.mono and ((snowy_blast_score and selected_cds_length.zero)  or selected_cds_length.mono)))]
    parameters:
        selected_cds_length.zero: {operator: gt, value: 300} # 600
        exon_num.multi: {operator: gt, value: 2}
        cdna_length.multi: {operator: ge, value: 300}
        selected_cds_length.multi: {operator: gt, value: 250}
        exon_num.mono: {operator: eq, value: 1}
        snowy_blast_score: {operator: gt, value: 0}  # 0.3
        selected_cds_length.mono: {operator: gt, value: 600} # 900
        exon_num.mono: {operator: le, value: 2}
Pointing Mikado at the new configuration file

When the new scoring file is complete, we can point Mikado pick at it in two ways:

  • Either transiently, with the “–scoring-file” switch, followed by the file name.
  • Or in the configuration file for the project, by putting the file name under the pick/scoring_file field.

When Mikado pick will be launched, it will validate - before starting - the validity of the scoring file. Common mistakes:

  • Using a metric which does not exist.
  • Using an invalid combination of “operator”, “value” and “rescaling” parameters; for example using a value of “true” with “gt”, ie “greater than” (see the section on operators).
  • Using an invalid connector in the “expression” statements: only “and”, “or”, “xor”, “not” and brackets are accepted (see the requirements section)

Mikado should emit an error that will help you understand how to correct the issue.

Adapting Mikado to specific user-cases

Although Mikado provides generally sane defaults for most projects and species, one of its key advantages is its flexibility, which allows it to be tailored for the needs of various projects. In this section we provide an overview on how to adapt the workflow to specific user-cases.

In general, this tailoring can be performed in these sections of the workflow:

  • when launching mikado configure, to set up influential parameters such as the flank clustering distance or the expected intron size range.
  • by modifying directly the resulting configuration file.
  • by modifying the scoring file.
Case study 1: adapting Mikado to your genome of interest

When adapting Mikado to a new species, some of the most important factors to be considered are:

  • how compact is the species’ genome - ie, what is the expected genomic distance between two neighbouring genes?
  • what is the expected intron size range?
  • what is the expected UTR/CDS ratio, ie, will transcripts generally have short UTR sections (e.g. Arabidopsis thaliana, with its average ~80% coding section) or will they instead have long, multiexonic UTRs (as is the case in e.g. Homo sapiens)?
  • will the organism mostly possess multi-exonic, long genes with many splicing variants (as is the case for mammals, e.g. our own species) or does it instead harbour mostly short transcripts with a low number of exons - potentially, even, mostly monoexonic (as is the case for many fungi, e.g., Saccharomyces cerevisiae)?

A good starting point for understanding how to answer these questions is to head over to EnsEMBL, and look for an annotated species sufficiently similar to the one under examination. For example, let us say that we want to annotate a yeast in the same family of S. pombe. We can fetch its GTF annotation from the FTP download page of EnsEMBL and start analysing it. We can use to this end mikado util stats:

mikado util stats Schizosaccharomyces_pombe.ASM294v2.38.gff3.gz Schizosaccharomyces_pombe.ASM294v2.38.stats

which returns the following:

Stat Total Average Mode Min 1% 5% 10% 25% Median 75% 90% 95% 99% Max
Number of genes 7268 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of genes (coding) 5145 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of monoexonic genes 4715 NA NA NA NA NA NA NA NA NA NA NA NA NA
Transcripts per gene 7269 1.00 1 1 1 1 1 1 1 1 1 1 1 2
Coding transcripts per gene 5146 0.71 1 0 0 0 0 0 1 1 1 1 1 2
CDNA lengths 12610347 1,734.81 72 47 72 83 306 846 1,504 2,322 3,370 4,103 5,807 15,022
CDNA lengths (mRNAs) 10592917 2,058.48 1315 75 283 546 753 1,201 1,792 2,613 3,675 4,425 6,404 15,022
CDS lengths 7178717 987.58 0 0 0 0 0 0 750 1,461 2,308 3,008 4,944 14,775
CDS lengths (mRNAs) NA 1,395.01 354;750 75 201 307 387 669 1,137 1,752 2,661 3,420 5,502 14,775
CDS/cDNA ratio NA 67.48 100.0 5 15 28 37 53 70 84 93 100 100 100
Monoexonic transcripts 4715 1,656.09 72 47 71 74 119 695 1,413 2,276 3,386 4,124 5,950 14,362
MonoCDS transcripts 2753 1,485.99 375;432;4002 93 218 332 418 732 1,191 1,806 2,849 3,784 5,875 14,154
Exons per transcript 12633 1.74 1 1 1 1 1 1 1 2 3 4 7 16
Exons per transcript (mRNAs) 3081 2.03 1 1 1 1 1 1 1 3 4 5 7 16
Exon lengths NA 998.21 72 2 25 62 79 171 538 1,474 2,522 3,298 5,002 14,362
Exon lengths (mRNAs) NA 1,013.39 106 3 24 59 86 175 499 1,516 2,605 3,414 5,117 14,362
Intron lengths NA 83.69 49 1 17 38 41 46 56 85 162 226 411 2,526
Intron lengths (mRNAs) NA 84.16 46 1 36 39 41 46 56 86 162 227 412 2,526
CDS exons per transcript 2091 1.41 1 0 0 0 0 0 1 2 3 4 7 16
CDS exons per transcript (mRNAs) 2091 1.99 1 1 1 1 1 1 1 3 4 5 7 16
CDS exon lengths 7178717 702.76 69;99 1 8 29 49 106 307 990 1,797 2,474 4,370 14,154
CDS Intron lengths 414501 81.77 44;45 0 35 38 40 45 55 84 157 220 395 2,525
5’UTR exon number 5146 0.95 1 0 0 0 1 1 1 1 1 1 2 3
3’UTR exon number 5146 0.93 1 0 0 0 1 1 1 1 1 1 2 3
5’UTR length 1372304 266.67 0 0 0 0 17 71 154 309 586 932 1,935 4,397
3’UTR length 2041896 396.79 0 0 0 0 46 126 243 441 865 1,386 2,644 5,911
Stop distance from junction NA 7.58 0 0 0 0 0 0 0 0 0 0 23 3,385
Intergenic distances NA -60.33 -66 -9,461 -3,753 -2,180 -1,382 -161 64 365 842 1,279 2,698 31,961
Intergenic distances (coding) NA 297.72 -66 -7,815 -3,477 -1,598 -440 -32 176 600 1,302 1,913 3,924 78,421

From this table we can already see the following:

  • Most genes (5145 out of 7268, or 70.9%) are monoexonic
  • The average and modal intergenic distance between genes are very small, with almost half of the recorded distances being negative - indicating that most genes are actually overlapping.
  • Only a very small handful of genes (less than 1%) is annotated as having any splicing event
  • On average, UTRs occupy 33% of the length of coding transcripts (CDS/cDNA ratio is at 67%, on average) but most often transcripts actually lack an UTR at all (mode of 100%).
  • 98% of the introns have a length between 36 and 412 bps.

On the basis of this information, we can now start to customize the behaviour of Mikado for the species.

Creating the scoring file

The first step in the process is for us to create a scoring file, following the tutorial on the subject. We will call it “spombe.yaml”; as detailed in the link before, we will write it in the textual YAML format.

Following the indications above and those in the tutorial, we should make the following changes in terms of priority for transcripts:

  • we want mostly monoexonic transcripts
  • transcripts with a UTR ratio under 33%
  • we should look at most to 1 UTR exon, each way, targeting 0 (most transcripts are monoexonic and have their UTR contained in the same exon as the ORF).
  • the distance of the stop codon from the nearest junction should be 0 (again this follows from having mostly monoexonic transcripts).

The scoring section would therefore end up looking like this:

scoring:
      snowy_blast_score: {rescaling: max}
      cdna_length: {rescaling: max}
      cds_not_maximal: {rescaling: min}
      cds_not_maximal_fraction: {rescaling: min}
      exon_num: {
        rescaling: target,
        value: 1
      }
      five_utr_length:
        filter: {operator: le, value: 2500}
        rescaling: target
        value: 100
      five_utr_num:
        filter: {operator: lt, value: 4}
        rescaling: target
        value: 0
      end_distance_from_junction:
        filter: {operator: lt, value: 23}
        rescaling: min
      highest_cds_exon_number: {rescaling: target, value: 1}
      intron_fraction: {rescaling: min}
      is_complete: {rescaling: target, value: true}
      number_internal_orfs: {rescaling: target, value: 1}
      non_verified_introns_num: {rescaling: min}
      # proportion_verified_introns_inlocus: {rescaling: max}
      retained_fraction: {rescaling: min}
      retained_intron_num: {rescaling: min}
      selected_cds_fraction: {rescaling: target, value: 1, filter: {operator: gt, value: 0.7 }}
      # selected_cds_intron_fraction: {rescaling: max}
      selected_cds_length: {rescaling: max}
      selected_cds_num: {rescaling: target, value: 1}
      three_utr_length:
        filter: {operator: le, value: 2500}
        rescaling: target
        value: 200
      three_utr_num:
        filter: {operator: lt, value: 2}
        rescaling: target
        value: 0
      combined_cds_locus_fraction: {rescaling: max}

Now that we have codified the scoring part, the next step is to determine the requirements regarding the transcripts that should be accepted into our annotation. Given the simplicity of the organism, we can satisfy ourselves with the following two requirements:

  • No transcript should be shorter than 75 bps (minimum length for coding transcripts)
  • No transcript should have an intron longer than ~2600 bps (in the annotation the maximum is 2,526); we can be slightly more permissive here and set the limit at 3,000 bps.

This will yield the following, very simple requirements section:

requirements:
    expression:
        - cdna_length and max_intron_length
    parameters:
        cdna_length: {operator: ge, value: 75}
        max_intron_length: {operator: lt, value: 3000}
Modifying the general configuration file

The second step in the customization process is to personalize the general configuration. On the basis of what we know of S. pombe, we have to intervene here in the following way:

  • set the intron range: per above, a reasonable setting should be 36-412.
  • set the clustering flank: given the very compact size of the genome, we should aim for something very small - probably 50bps is plenty.
  • given the very compact size of the genome and the general lack of splicing, it is also advised to set Mikado to split any chimeric transcripts - the chances are very, very high that any such occurrence is artifactual.
  • make alternative splicing calling a very rare occurrence

First of all, we will download our genome in a single file (genome.fasta). We will use the pretty boilerplate A. thaliana scoring configuration as our starting block, and we will ask Daijin to copy it to the current location.

daijin configure --scoring spombe.yaml \
    --flank 50 \
    -i 36 412 \
    -m split \
    -o configuration.yaml \
    --genome genome.fasta

Once the configuration file has been created, we have to perform another couple of modifications, to make Mikado more stringent in terms of alternative splicing events. Look for the section mikado/pick. Here we can do the following:

  1. If you are completely uninterested in alternative splicing events, you can just set the “report” flag to false. This will disable AS calling completely.

  2. If you want to still report AS events but at a far lower rate, you can:

    • reduce the number of maximum isoforms reported: from 5 to 2, for example. Note: reducing this number to 1 will have the same effect as disabling AS calling completely.
    • restrict the types of AS events we call (see the class code section for more details). We can for example restrict the calling to “j” and “G”, and potentially add “g” (i.e. consider as a valid alternative splicing event for a multiexonic transcript a monoexonic one).
    • increase the minimum score percentage of an AS event for it to be reported, to extremely high values (such as 0.9 to 0.99). This will ensure that only a small amount of isoforms will be called.
    • increase the minimum cDNA/CDS overlap between the AS events and the primary transcript. This cannot go up to 100% for both, otherwise no AS event will ever be reported. However, you could for example set the CDS overlap to 100%, if you are only interested in alternative UTR splicing.
    • leave the “keep_retained_introns” field as false, and “only_confirmed_introns” field as “true”.

Once these modifications have been made, Mikado is ready to be run.

Case study 2: noisy RNA-Seq data

With RNA-Seq, a relatively common happenstance is the presence of noise in the data - either experimentally, through the presence of pre-mRNA, genomic contamination, or otherwise erroneous transcripts; or from computational artifacts, e.g. an explicit choice on the part of the experimenter to retrieve from the data even isoforms and loci with little coverage support, in an attempt to boost the sensitivity of the analysis at the cost of decreased precision.

In such instances, it might make sense to make Mikado more stringent than usual. In this tutorial we will focus on the following:

  • Making Mikado more aggressive in filtering out putative fragments
  • Making Mikado more aggressive in splitting chimeric transcripts
  • Making Mikado more aggressive in filtering out incorrect alternative splicing events such as retained introns

For ease of discussion, we will suppose that we are working in a species similar in features to D. melanogaster. We will, therefore, be using a copy of the dmelanogaster_scoring.yaml file included in the distribution of Mikado.

Modifying the general configuration file and obtaining a copy of the original template

Before touching the scoring file, this time we will call the Daijin configurator in order to obtain a copy of the original D. melanogaster scoring file. We will suppose to have relevant proteins in “proteins.fasta” (e.g. a dataset assembled from SwissProt), and that - like for D. melanogaster - the acceptable intron size range is between 50 and 26000 bps. As the data is quite noisy, we have to expect that there will be fragments derived from mis-alignments or genomic contamination; we will, therefore, enlarge the normal flanking area to 2000 bps. This will allow to catch more of these events, when we check for potential fragments in the neighbourhood of good loci. Regarding probable chimeric events, we will be quite aggressive - we will split any chimeric event which is not supported by a good blast hit against the database (“-m permissive”).

daijin configure \
    --scoring dmelanogaster_scoring.yaml --copy-scoring noisy.yaml  \
    --flank 2000 \
    -i 50 26000 \
    -m permissive \
    -o configuration.yaml \
    --genome genome.fasta \
    --prot-db proteins.fasta

Once created, the configuration file should be modified as follows:

  • in the pick/alternative_splicing section:

    • increase the stringency for calling an alternative splicing event:

      • min_score_percentage: from 0.5 to 0.75
      • max_isoforms: from 5 to 3
  • in the pick/fragments section:

    • add “I” (multi-exonic and within an intron of the reference locus) to the list of valid_class_codes

Please note that by default Mikado will look for alternative splicing events that have all introns not shared with the primary transcript to be confirmed externally. Also, it will exclude any transcript with retained introns. We should keep these options on their default value, as they will already contribute a significantly to reducing the number of spurious splicing events.

Customising the scoring file

Looking at the scoring section of the file, we do not need to apply anything particular here - the predefined definitions will already reward coding, homology-supported transcripts.

Default scoring for D. melanogaster

scoring:
  snowy_blast_score: {rescaling: max}
  cdna_length: {rescaling: max}
  cds_not_maximal: {rescaling: min}
  cds_not_maximal_fraction: {rescaling: min}
  # exon_fraction: {rescaling: max}
  exon_num: {
    rescaling: max,
    filter: {
    operator: ge,
    value: 3}
  }
  five_utr_length:
    filter: {operator: le, value: 2500}
    rescaling: target
    value: 100
  five_utr_num:
    filter: {operator: lt, value: 4}
    rescaling: target
    value: 2
  end_distance_from_junction:
    filter: {operator: lt, value: 55}
    rescaling: min
  highest_cds_exon_number: {rescaling: max}
  intron_fraction: {rescaling: max}
  is_complete: {rescaling: target, value: true}
  number_internal_orfs: {rescaling: target, value: 1}
  # proportion_verified_introns: {rescaling: max}
  non_verified_introns_num: {rescaling: min}
  proportion_verified_introns_inlocus: {rescaling: max}
  retained_fraction: {rescaling: min}
  retained_intron_num: {rescaling: min}
  selected_cds_fraction: {rescaling: target, value: 0.8}
  selected_cds_intron_fraction: {rescaling: max}
  selected_cds_length: {rescaling: max}
  selected_cds_num: {rescaling: max}
  three_utr_length:
    filter: {operator: le, value: 2500}
    rescaling: target
    value: 200
  three_utr_num:
    filter: {operator: lt, value: 3}
    rescaling: target
    value: 1
  combined_cds_locus_fraction: {rescaling: max}

We can and should, however, modify the minimum requirements for transcripts in general, for alternative splicing events, and for not considering a given locus as a putative fragment.

First off, for the minimum requirements, we will tweak the requirements in this way:

  • discard any multiexonic transcript without verified introns. Normally we would discard such transcripts only if there are verified introns in the region. In this case, we would like to get rid of these transcripts altogether:

    • verified_introns_num: {operator: gt, value: 0}

    • If we would like to be really stringent, we could instead exclude any transcript with any amount of non-verified introns:

      • non_verified_introns_num: {operator: eq, value: 0}
  • discard any transcript with suspicious splicing events (ie splicing events that would be canonical if transferred on the opposite strand):

    • suspicious_splicing: {operator: eq, value: false}
  • let us also be more stringent on the maximum intron length, and decrease it from the permissive 150,000 to a much more stringent 30,000 (slightly higher than the 26,000 used for the “acceptable” intron range, above).

    • max_intron_length: {operator: le, value: 30000}
  • discard any monoexonic transcript without a CDS. This is more stringent than the default setting (where we keep non-coding monoexonic transcripts that have a some homology to a protein in the supplied database).

    • selected_cds_length.mono: {operator: gt, value: 0}

Altogether, this becomes:

requirements:
    expression:
        - ((exon_num.multi and (cdna_length.multi or selected_cds_length.multi)
        - and
        - max_intron_length and min_intron_length and verified_introns_num and suspicious_splicing)
        - or
        - (exon_num.mono and selected_cds_length.mono)))
    parameters:
        selected_cds_length.mono: {operator: gt, value: 300}
        cdna_length.multi: {operator: ge, value: 400}
        selected_cds_length.multi: {operator: gt, value: 200}
        exon_num.mono: {operator: eq, value: 1}
        exon_num.multi: {operator: gt, value: 1}
        max_intron_length: {operator: le, value: 30000}
        min_intron_length: {operator: ge, value: 20}
        verified_introns_num: {operator: gt, value: 1}
        suspicious_splicing: {operator: eq, value: false}

We should also adapt the requirements for alternative splicing events. Compared with the default settings, we can now remove the “suspicious_splicing” requirement - it is already present in the general requirements for a transcript, so it will never be invoked. However, we will make certain that no transcript with more than one ORF will ever be selected as an alternative splicing event: these transcripts are often generated by retained intron events or trans-splicing. It should be a rare event, but by putting a requirement here, we will ensure that no transcript of this kind will be brought back as ASE.

Finally, we will consider as fragmentary any non-coding transcript in the neighbourhood of a coding locus. We will consider as potentially fragmentary also any coding transcript with a short ORF (<100 aa, or 300 bps). The expression will be, in this case, very simple:

expression: [selected_cds_length]
parameters:
    selected_cds_length: {operator: gt, value: 300}
Case study 3: comprehensive splicing catalogue

There are cases in which we would like our annotation to be as comprehensive as possible, ie. to include transcripts that we would normally exclude from consideration. For example, we might want to study the prevalence of retained intron events in a sample, or keep events that do not have a good read coverage and whose introns might, therefore, be recognised as invalid by Portcullis. It is possible to tweak Mikado’s behaviour to this end.

Modifying the general configuration file and obtaining a copy of the original template

Like in the second case, we will presume to be working in a similar species to D. melanogaster. Again, we will create the configuration file thus:

daijin configure \
    --scoring dmelanogaster_scoring.yaml --copy-scoring comprehensive.yaml  \
    --flank 200 \
    -i 50 26000 \
    -m permissive \
    -o configuration.yaml \
    --genome genome.fasta \
    --prot-db proteins.fasta

Notice that compared to the previous example we reduced the flanking distance to the standard value (200 bps instead of 2000 bps) as we are less worried of fragmentary loci.

In the configuration file, we will change the following:

  • under pick/alternative_splicing:

    • switch “keep_retained_introns” to true

    • switch “only_confirmed_introns” to false

    • potentially, increase the number of isoforms from 5 to 10 or higher

    • consult the documentation on class codes to verify which additional AS events you would like to keep; by default, Mikado will include cases where the transcript has at least a different splicing site (j), no splicing site in common with the original transcript but introns roughly coincident (h), novel introns in the terminal exons (J) or within the primary mono-exonic transcript (G).

      • For a comprehensive catalogue, we would recommend to include at least “C” (transcript roughly contained, but with “spilling” within the intron(s) of the primary transcript).
    • To include transcripts quite dissimilar from the primary, potentially lower the percentages for:

      • min_cds_overlap
      • min_cdna_overlap
      • min_score_perc

Warning

The heuristics we are touching in this section are core to the precision of Mikado. For example, allowing Mikado to bring back retained intron events will, by definition, bring into the annotation transcripts that are normally ignored. Please consider this when configuring the run and later, when reviewing the results.

Customising the scoring file

In this case, as we are interested in retaining a greater variety of splicing events, we will concentrate only on one section of the file, ie the “as_requirements” section. Compared to the default settings, we are going to remove the UTR requirements, and bring back transcripts with long UTRs. This is because many transcripts with retained intron events will have, by default, longer UTRs than usual. We will rely on the general prioritisation instead to penalise these transcripts in general (and thus avoid bringing them back if they are or really poor quality). We will still exclude cases with “suspicious_splicing”, ie cases most probably generated by trans-splicing.

as_requirements:
  expression: [cdna_length and suspicious_splicing]
  parameters:
    cdna_length: {operator: ge, value: 200}
    suspicious_splicing: {operator: ne, value: true}
Usage

Mikado is composed of four different programs (configure, prepare, serialise, pick) which have to be executed serially to go from an ensemble of different assemblies to the final dataset. In addition to these core programs, Mikado provides a utility to compare annotations, similarly to CuffCompare and ParsEval (compare), and various other minor utilities to perform operations such as extracting regions from a GFF, convert between different gene annotation formats, etc.

Mikado pipeline stages
Mikado configure

This utility prepares the configuration file that will be used throughout the pipeline stages. While the most important options can be set at runtime through the command line, many algorithmic details can be accessed and intervened upon only through the file produced through this command.

Important

Please note that any value absent from the configuration at runtime will be imputed to the default value for Mikado, as specified internally.

Input annotation files

The preferred method for providing Mikado configure with the input annotation files is a <TAB> delimited file, where each line specifies a different file name. The fields in this file are as follows, for each row:

  1. File name of the input file. Mandatory
    • The file name must specify a valid path (either absolute or relative) from the folder where Mikado configure is launched to the file itself.
  2. Label for the input file. Mandatory
    • Each label must be unique within the run.
  3. Strandedness of the annotation file. Boolean (either True or False, capitalization is ignored). Mandatory.
    • Reference files will be considered as stranded even if this value is set to False.
  4. Score associated to the input file. Default 0. Optional. Floating numbers only.
    • The score will be used to determine tie winners during Mikado prepare in case of redundancy redundant-transcripts-in-prepare. It will also be applied to transcripts during the pick stage. See this section later on.
  5. is_reference: boolean flag (either True or False, capitalization is ignored), default False. Optional.
    • “Reference” transcripts are treated slightly differently from the others in Mikado. Specifically:
      • During the prepare stage, reference transcripts are never discarded on account of redundancy, short size, etc.; moreover, their strand will never be reversed on account of the internal Mikado checks.
      • During the pick stage, reference transcripts can be given an additional boost in their scoring. Moreover, it is possible to instruct Mikado to only look for alternative splicing events of the reference transcripts.
  6. exclude_redundant: boolean flag (either True or False, capitalization is ignored), default False. Optional.
    • If set to True, Mikado prepare will remove any transcript in this annotation that is considered redundant (ie identical or contained within another transcript).
      • By default, Mikado will remove only *identical* copies of transcripts across annotations. Datasets with this flag set will be treated as if they were going through GffRead.
      • See the this section for more details.
  7. strip_cds: boolean flag (either True or False, capitalization is ignored), default False. Optional.
    • If set to True, transcripts from this set will have their CDS removed. This is useful e.g. when analysing transcripts coming from GMAP.
  8. skip_split boolean flag (either True or False, capitalization is ignored), default False. Optional.
    • If set to True, Mikado will not break suspected chimeras for these transcripts, ie transcripts with more than one internal ORF, during the pick stage.
Usage

This command will generate a configuration file (in either JSON or YAML format), with the correct configuration for the parameters set on the command line. See the in-depth section on the structure of the configuration file for details.

Selected command line parameters:

  • –full: By default, Mikado configure will output a stripped-down configuration file, with only some of the fields explicitly present. Use this flag to show all the available fields.
  • –seed: random seed for full reproducibility of Mikado runs.
  • –list: the argument to this option specifies the input annotation files for Mikado. See above.

Command line help:

$ mikado configure --help
usage: Mikado configure [-h] [--full] [--seed SEED] [--minimum-cdna-length MINIMUM_CDNA_LENGTH] [--max-intron-length MAX_INTRON_LENGTH] [--scoring SCORING] [--copy-scoring COPY_SCORING]
                        [-i INTRON_RANGE INTRON_RANGE] [--subloci-out SUBLOCI_OUT] [--monoloci-out MONOLOCI_OUT] [--no-pad] [--only-reference-update] [-eri] [-kdc] [--check-references]
                        [-mco MIN_CLUSTERING_CDNA_OVERLAP] [-mcso MIN_CLUSTERING_CDS_OVERLAP] [--strand-specific] [--no-files | --gff GFF | --list LIST] [--reference REFERENCE] [--junctions JUNCTIONS]
                        [-bt BLAST_TARGETS] [--strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES] [--labels LABELS] [--codon-table CODON_TABLE] [--external EXTERNAL] [--daijin] [-bc BLAST_CHUNKS]
                        [--use-blast] [--use-transdecoder] [--mode {nosplit,stringent,lenient,permissive,split} [{nosplit,stringent,lenient,permissive,split} ...]] [--scheduler {local,SLURM,LSF,PBS}]
                        [--exe EXE] [-c CLUSTER_CONFIG] [-t THREADS] [--skip-split SKIP_SPLIT [SKIP_SPLIT ...]] [-j | -y | --toml] [-od OUT_DIR]
                        [out]

Configuration utility for Mikado

positional arguments:
  out

optional arguments:
  -h, --help            show this help message and exit
  --full
  --seed SEED           Random seed number.
  --strand-specific     Boolean flag indicating whether all the assemblies are strand-specific.
  --no-files            Remove all files-specific options from the printed configuration file. Invoking the "--gff" option will disable this flag.
  --gff GFF             Input GFF/GTF file(s), separated by comma
  --list LIST           Tab-delimited file containing rows with the following format: <file> <label> <strandedness(def. False)> <score(optional, def. 0)> <is_reference(optional, def. False)>
                        <exclude_redundant(optional, def. True)>  <skip_split(optional, def. False)> strandedness, is_reference, exclude_redundant and skip_split must be boolean values (True, False) score must be a valid floating number.
  --reference REFERENCE, --genome REFERENCE
                        Fasta genomic reference.
  --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
                        List of strand-specific assemblies among the inputs.
  --labels LABELS       Labels to attach to the IDs of the transcripts of the input files, separated by comma.
  --codon-table CODON_TABLE
                        Codon table to use. Default: 0 (ie Standard, NCBI #1, but only ATG is considered a valid start codon.
  --external EXTERNAL   External configuration file to overwrite/add values from. Parameters specified on the command line will take precedence over those present in the configuration file.
  -t THREADS, --threads THREADS
  --skip-split SKIP_SPLIT [SKIP_SPLIT ...]
                        List of labels for which splitting will be disabled (eg long reads such as PacBio)
  -j, --json            Output will be in JSON (default: inferred by filename, with TOML as fallback).
  -y, --yaml            Output will be in YAML (default: inferred by filename, with TOML as fallback).
  --toml                Output will be in TOML (default: inferred by filename, with TOML as fallback).
  -od OUT_DIR, --out-dir OUT_DIR
                        Destination directory for the output.

Options related to the prepare stage.:
  --minimum-cdna-length MINIMUM_CDNA_LENGTH
                        Minimum cDNA length for transcripts.
  --max-intron-length MAX_INTRON_LENGTH
                        Maximum intron length for transcripts.

Options related to the scoring system:
  --scoring SCORING     Scoring file to use. Mikado provides the following: mammalian.yaml, plant.yaml, HISTORIC/athaliana_scoring.yaml, HISTORIC/celegans_scoring.yaml, HISTORIC/dmelanogaster_scoring.yaml,
                        HISTORIC/hsapiens_scoring.yaml, HISTORIC/human.yaml, HISTORIC/insects.yaml, HISTORIC/plants.yaml, HISTORIC/scerevisiae.yaml, HISTORIC/worm.yaml
  --copy-scoring COPY_SCORING
                        File into which to copy the selected scoring file, for modification.

Options related to the picking:
  -i INTRON_RANGE INTRON_RANGE, --intron-range INTRON_RANGE INTRON_RANGE
                        Range into which intron lengths should fall, as a couple of integers. Transcripts with intron lengths outside of this range will be penalised. Default: (60, 900)
  --subloci-out SUBLOCI_OUT
                        Name of the optional subloci output. By default, this will not be produced.
  --monoloci-out MONOLOCI_OUT
                        Name of the optional monoloci output. By default, this will not be produced.
  --no-pad              Disable transcript padding. On by default.
  --only-reference-update
                        Flag. If switched on, Mikado will only keep loci where at least one of the transcripts is marked as "reference". CAUTION: new and experimental. If no transcript has been marked as
                        reference, the output will be completely empty!
  -eri, --exclude-retained-introns
                        Exclude all retained intron alternative splicing events from the final output. Default: False. Retained intron events that do not dirsupt the CDS are kept by Mikado in the final
                        output.
  -kdc, --keep-disrupted-cds
                        Keep in the final output transcripts whose CDS is most probably disrupted by a retained intron event. Default: False. Mikado will try to detect these instances and exclude them from
                        the final output.
  --check-references    Flag. If switched on, Mikado will also check reference models against the general transcript requirements, and will also consider them as potential fragments. This is useful in the
                        context of e.g. updating an *ab-initio* results with data from RNASeq, protein alignments, etc.
  -mco MIN_CLUSTERING_CDNA_OVERLAP, --min-clustering-cdna-overlap MIN_CLUSTERING_CDNA_OVERLAP
                        Minimum cDNA overlap between two transcripts for them to be considered part of the same locus during the late picking stages. NOTE: if --min-cds-overlap is not specified, it will be
                        set to this value! Default: 20%.
  -mcso MIN_CLUSTERING_CDS_OVERLAP, --min-clustering-cds-overlap MIN_CLUSTERING_CDS_OVERLAP
                        Minimum CDS overlap between two transcripts for them to be considered part of the same locus during the late picking stages. NOTE: if not specified, and --min-cdna-overlap is
                        specified on the command line, min-cds-overlap will be set to this value! Default: 20%.

Options related to the serialisation step:
  --junctions JUNCTIONS
  -bt BLAST_TARGETS, --blast_targets BLAST_TARGETS

Options related to configuring a Daijin run.:
  --daijin              Flag. If set, the configuration file will be also valid for Daijin.
  -bc BLAST_CHUNKS, --blast-chunks BLAST_CHUNKS
                        Number of parallel DIAMOND/BLAST jobs to run. Default: 10.
  --use-blast           Flag. If switched on, Mikado will use BLAST instead of DIAMOND.
  --use-transdecoder    Flag. If switched on, Mikado will use TransDecoder instead of Prodigal.
  --mode {nosplit,stringent,lenient,permissive,split} [{nosplit,stringent,lenient,permissive,split} ...]
                        Mode(s) in which Mikado will treat transcripts with multiple ORFs. - nosplit: keep the transcripts whole. - stringent: split multi-orf transcripts if two consecutive ORFs have both
                        BLAST hits and none of those hits is against the same target. - lenient: split multi-orf transcripts as in stringent, and additionally, also when either of the ORFs lacks a BLAST hit
                        (but not both). - permissive: like lenient, but also split when both ORFs lack BLAST hits - split: split multi-orf transcripts regardless of what BLAST data is available. If multiple
                        modes are specified, Mikado will create a Daijin-compatible configuration file.
  --scheduler {local,SLURM,LSF,PBS}
                        Scheduler to use. Default: None - ie, either execute everything on the local machine or use DRMAA to submit and control jobs (recommended).
  --exe EXE             Configuration file for the executables.
  -c CLUSTER_CONFIG, --cluster_config CLUSTER_CONFIG
                        Cluster configuration file to write to.
Anatomy of the configuration file
Format of the configuration file

The configuration files accepted by Mikado can be in any of three dialects:

  • TOML, the default choice. TOML is an intuitive configuration file format, similar to the INI files preferred by Python.
  • YAML, a human-readable configuration file format based on indentation. Less preferred because of the unreadability of deeply-nested values.
  • JSON, a less human-readable file format that is commonly used to pass data across processes / programs.

We leave freedom to the user to select their preferred file format. In this section, we will use TOML to explain the different sections of the file.

Global options

The following options apply to all programs in the Mikado pipeline, and they refer to general parameters such as logging verbosity, number of threads, etc.

Parameters:

  • threads: this is the number of processes/threads that will be requested by the Mikado programs. This parameter can be overridden on the command line.
  • seed: random seed specification, to ensure maximum reproducibility of the run.
  • multiprocessing_method: this specifies the way that Python will start children processes. The possible choices are “spawn” (default), “fork” and “fork-server”. See the sidebar for a more complete explanation.
threads = 4
seed = 0
multiprocessing_method = "spawn"
Log settings

It is possible to set high-level settings for the logs in the log_settings section:

  • log_level: level of the logging for Mikado. Options: DEBUG, INFO, WARNING, ERROR, CRITICAL. By default, Mikado will be quiet and output log messages of severity WARNING or greater.
  • sql_level: level of the logging for messages regarding the database connection (through SQLAlchemy). By default, SQLAlchemy will be set in quiet mode and asked to output only messages of severity WARNING or greater.

Warning

Mikado and SQLAlchemy can be greatly verbose if asked to output DEBUG or INFO messages, to the point of slowing down the program significantly due to the amount of writing to disk. Please consider setting the level to DEBUG only when there is a real problem to debug, not otherwise!

[log_settings]
# Settings related to the logs. Keys:
# - sql_level: verbosity for SQL calls. Default: WARNING. In decreasing order: 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'
# - log_level: verbosity. Default: INFO. In decreasing order: 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'
log_level = "INFO"
sql_level = "WARNING"
log = ""
Database settings

This section deals with the database settings that will be necessary for the serialisation and picking phases of the pipeline. By default, Mikado will use a SQLite database, but it currently also supports MySQL and PostgreSQL through SQLAlchemy. Fields:

  • db: name of the database to use. In case the database is SQLite, this will be the database file, otherwise it will be the database name.
  • dbtype: one of: * sqlite * mysql * postgresql
  • dbhost: host where the database is located. Required with MySQL and PostgreSQL.
  • dbuser: User of the database. Required with MySQL and PostgreSQL.
  • dbpasswd: Database password. Required with MySQL and PostgreSQL.
  • dbport: Port to access to the database. It defaults to the normal ports for the selected database.
[db_settings]
# Settings related to DB connection. Parameters:
# db: the DB to connect to. Required. Default: mikado.db
# dbtype: Type of DB to use. Choices: sqlite, postgresql, mysql. Default: sqlite.
db = "/c/Users/lucve/PycharmProjects/EICore/mikado/sample_data/mikado.db"
dbtype = "sqlite"
dbhost = "localhost"
dbuser = ""
dbpasswd = ""
dbport = 0
Reference settings

This section of the configuration file deals with the reference genome. It specifies two fields:

  • genome: the genome FASTA file. Required.
  • genome_fai: FAI index of the genome. Used by Mikado serialise, it can be inferred if left null.
  • transcriptome: optional annotation file for the genome. Mikado currently ignores this field, but it is used by Daijin to guide some of the RNA-Seq assemblies.
[reference]
genome = "chr5.fas.gz"
genome_fai = ""
transcriptome = ""
Settings for the prepare stage

This section of the configuration file deals with the prepare stage of Mikado. It specifies the input files, their labels, and which of them are strand specific. The available fields are the following:

  • exclude_redundant: if set to true, Mikado will only keep one copy of transcripts that are identical or contained into a different transcripts. - please note that this global values, if set to true, overrides the label-specific
  • canonical: this voice specifies the splice site donors and acceptors that are considered canonical for the species. By default, Mikado uses the canonical splice site (GT/AG) and the two semi-canonical pairs (GC/AG and AT/AC). Type: Array of two-element arrays, composed by two-letter strings.
  • lenient: boolean value. If set to false, transcripts that either only have non-canonical splice sites or have a mixture of canonical junctions on both strands will be removed from the output. Otherwise, they will left in, be properly tagged.
  • minimum_cdna_length: minimum length of the transcripts to be kept.
  • max_intron_length: Transcripts with introns greater than this will be discarded. The default is one million base pairs (effectively disabling the option).
  • strand_specific: boolean. If set to true, all input assemblies will be treated as strand-specific, therefore keeping the strand of monoexonic fragments as it was. Multiexonic transcripts will not have their strand reversed even if doing that would mean making some or all non-canonical junctions canonical.
  • strip_cds: boolean. If set to true, the CDS features will be stripped off the input transcripts. This might be necessary for eg transcripts obtained through alignment with GMAP [GMAP].
  • single: boolean. For debug purposes only. If set to true, Mikado will disable multiprocessing.
Settings for the prepare stage: files settings

Important

As this section contains multiple linked lists, it is recommended to not edit this part of the configuration file directly, but rather, to rely on the mikado configure utility / mikado prepare interface to set it up. Specifically, setting up this section through the use of a file of file names is highly recommended.

This sub-section is the most important for prepare, as it contains among other things the locations and labels for the input files.

  • output_dir: destination folder for the output files and the log. It will be created automatically, if it does not already exist on disk.
  • out: name of the output GTF file. Default: mikado_prepared.gtf.
  • out_fasta: name of the output GTF file. Default: mikado_prepared.fasta.
  • log: name of the log file. Default: prepare.log.
  • gff: list of filenames of the input files.
  • labels: list of labels associated with the input files.
  • reference: list of boolean values, indicating whether each input assembly is to be considered of “reference” quality.
  • strand_specific_assemblies: list of boolean values, indicating whether each input assembly is to be considered having a trustworthy strand information, or not.
  • strip_cds: list of boolean values, indicating whether the CDS of a given assembly should be ignored.
  • exclude_redudant: list of boolean values, indicating whether redundant models prsent in this assembly should be discarded or not.
  • source_score: dictionary linking the scores of each different assembly to a specific score, using the label as key, which will be applied in two different points: + during the prepare stage itself, in order to give an order priority for transcripts that come from different assemblies. + during the picking stage, this score will be added to each model from this assembly, therefore influencing the picked models.
[prepare.files]
# Options related to the input and output files.
# - out: output GTF file
# - out_fasta: output transcript FASTA file
# - gff: array of input predictions for this step.
# - labels: labels to be associated with the input GFFs. Default: None.
# - reference: these files are treated as reference-like, ie, these transcripts will never get discarded
#   during the preparation step.
output_dir = "."
out = "mikado_prepared.gtf"
out_fasta = "mikado_prepared.fasta"
log = "prepare.log"
gff = ["class.gtf", "cufflinks.gtf", "stringtie.gtf", "trinity.gff3", "reference.gff3"]
labels = ["cl", "cuff", "st", "tr", "at"]
strand_specific_assemblies = ["class.gtf", "cufflinks.gtf", "stringtie.gtf", "reference.gff3"]
reference = [false, false, false, false, true]
exclude_redundant = [false, false, true, false, true]
strip_cds = [false, false, false, false, false]

[prepare.files.source_score]
cl = 0
cuff = 0
st = 1.0
tr = -0.5
at = 5.0
Settings for the serialisation stage

This section of the configuration file deals with the serialisation stage of Mikado. It specifies the location of the ORF BED12 files from TransDecoder, the location of the XML files from BLAST, the location of portcullis junctions, and other details important at run time. It has the following voices:

  • substitution_matrix: the matrix used by BLAST or DIAMOND. Default is the standard BLOSUM62.
  • force: whether the database should be truncated and rebuilt, or just updated.
  • max_objects: this parameter is quite important when running with a SQLite database. SQLite does not support caching on the disk before committing the changes, so that every change has to be kept in memory. This can become a problem for RAM quite quickly. On the other hand, committing is an expensive operation, and it makes sense to minimise calls as much as possible. This parameter specifies the maximum number of objects Mikado will keep in memory before committing them to the database. The default number of 10 million privileges speed over RAM parsimony.
  • max_regression: this parameter is a float comprised between 0 and 1. Prodigal and TransDecoder will sometimes output open ORFs even in the presence of an in-frame start codon. Mikado can try to “regress” along the ORF until it finds one such start codon. This parameter imposes how much Mikado will regress, in percentage of the cDNA length.

Note

Recent versions of TransDecoder perform by default an analogous process. As such, we advise to keep this switch off if TransDecoder is used.

  • codon_table: this parameter indicates the codon table to use. We use the NCBI nomenclature, with a variation:
    • the code “0” is added to indicate a variation on the standard code (identifier “1”), which differs only in that only “ATG” is considered as a valid start codon. This is because in silico ORF predictions tend to over-predict the presence of non-standard “ATG” codons, which are rare in nature.
  • max_target_seqs: equivalent to the BLAST+ parameter of the same name - it indicates the maximum number of discrete hits that can be assigned to one sequence in the database.
  • single_thread: boolean, if set to true it will forcibly disable multi-threading. Useful mostly for debugging purposes.
[serialise]
# Options related to serialisation
# - force: whether to drop and reload everything into the DB
# - files: options related to input files
# - max_objects: Maximum number of objects to keep in memory while loading data into the database
# - max_regression: if the ORF lacks a valid start site, this percentage indicates how far
#   along the sequence Mikado should look for a good start site. Eg. with a value of 0.1,
#   on a 300bp sequence with an open ORF Mikado would look for an alternative in-frame start codon
#   in the first 30 bps (10% of the cDNA).
# - max_target_seqs: equivalently to BLAST, it indicates the maximum number of targets to keep
#   per blasted sequence.
# - discard_definition: Boolean. **Deprecated**, it was used for specifying how to load BLAST files.
# - single_thread: if true, Mikado prepare will force the usage of a single thread in this step.
# - codon_table: codon table to use for verifying/modifying the ORFs. Default: 0, ie
#  the universal codon table but enforcing as only valid start codon ATG.
substitution_matrix = "blosum62"
max_objects = 10000000
max_regression = 0.2
start_adjustment = true
max_target_seqs = 100000
force = false
single_thread = false
codon_table = 0
Settings for the serialisation stage: files settings

This sub-section of the configuration file codifies the location of the input and output files for serialise. It contains the following voices:

  • junctions: array of locations of reliable junction files. These must be in BED12 format. The preferred source for this is Portcullis [Portcullis].
  • log: log file.
  • orfs: array of locations of ORFs location on the cDNA, as created by eg TransDecoder [Trinity].
  • output_dir: output directory where the log file and the SQLite database will be written to (if SQLite has been chosen as the database type)
  • transcripts: input transcripts. This should be set to be equal to the output of Mikado prepare, ie the “out_fasta” field of the prepare section of the configuration file.
  • external_scores: this field indicates the location of a tabular file containing additional numeric values to be added to Mikado.
  • xml: this array indicates the location of the BLAST output file(s). Please see the section on serialisation for details. Elements of the array can be:
[serialise.files]
junctions = ["junctions.bed"]
xml = []
blast_loading_debug = false
external_scores = ""
orfs = []
transcripts = "mikado_prepared.fasta"
log = "serialise.log"
blast_targets = ["uniprot_sprot_plants.fasta"]
output_dir = "."
Settings for the pick stage

This section of the configuration file deals with the picking stage of Mikado. It specifies details on how to handle BLAST and ORF data, which alternative splicing events are considered as valid during the final stages of the picking, and other important algorithmic details. The section comprises the following subsections:

  • alternative_splicing: Options related to which AS events are considered as valid for the primary transcript in a locus.
  • chimera_split: Options related to how to handle transcripts with multiple valid ORFs.
  • files: Input and output files.
  • orf_loading: Options related to how to decide which ORFs to load onto each transcript.
  • output_format: options related to how to format the names of the transcripts, the source field of the GFFs, etc.
  • run_options: Generic options related either to the general algorithm or to the number of resources requested.
  • scoring_file: This value specifies the scoring file to be used for Mikado. These can be found in Mikado.configuration.scoring_files.

Hint

It is possible to ask for the configuration file to be copied in-place for customisation when calling mikado configure.

Each subsection of the pick configuration will be explained in its own right.

Giving different priorities to transcripts from different assemblies

It is possible to specify boni/mali to be assigned to specific labels. Eg, it might be possible to assign a bonus of 1 to any transcript coming from PacBio reads, or a malus to any transcript coming from a given assembler. Example of such a configuration:

[prepare.files.source_score]
cl = 0
cuff = 0
st = 1.0
tr = -0.5
at = 5.0

In this example, we are prioritising the reference annotation (“at”) by five points, the StringTie assembly by 1, and slightly penalising the Trinity assembly with a malus of half a point.

Parameters regarding the alternative splicing

After selecting the best model for each locus, Mikado will backtrack and try to select valid alternative splicing events. This section deals with how Mikado will operate the selection. In order to be considered as valid potential AS events, transcripts have to satisfy the minimum requirements specified in the scoring file. These are the available parameters:

  • report: boolean. Whether to calculate and report possible alternative splicing events at all. By default this is set to true; *setting this parameter to false will inactivate all the options in this section*.
  • keep_retained_introns: boolean. It specifies whether transcripts with retained introns will be accepted as potentially valid AS events. By default, they are.
  • keep_retained_introns: boolean. It specifies whether transcripts with a CDS disrupted by their retained intron will be accepted as potentially valid AS events. By default, Mikado will exclude them.
  • min_cdna_overlap: minimum cDNA overlap between the primary transcript and the AS candidate. By default, this is set to 0.5 (50%). It must be a number between 0 and 1.
  • min_cds_overlap: minimum CDS overlap between the primary transcript and the AS candidate. By default this is set to 0.6, ie 60%. It must be a number between 0 and 1.
  • min_score_perc: Minimum percentage of the score of the primary transcript that any candidate AS must have to be considered. By default, this is set to 0.5 (50%). It must be a number between 0 and 1.
  • only_confirmed_introns: boolean. If set to true (default), Mikado will consider as potential AS events only transcripts whose introns not shared with the primary transcript are confirmed in the dataset of reliable junctions.
  • redundant_ccodes: any candidate AS will be compared against all the transcripts already retained in the locus. If any of these comparisons returns one of the class codes specified in this array, the transcript will be ignored. The rationale is to avoid bringing back multiple minor variations of the same transcript. Default class codes: c, m, _, =, n.
  • valid_ccodes: any candidate AS will be compared against the primary transcript to determine the type of AS event. If the class code is one of those specified in this array, the transcript will be considered further. Valid class codes are within the categories “Alternative splicing”, “Extension” with junction F1 lower than 100%, and Overlap (with the exclusion of “m”). Default class codes: j, J, G, h.
  • pad: boolean option. If set to True, Mikado will try to pad transcripts so that they share the same 5’. Please see this section for further information.
  • ts_max_splices: numerical. When padding is activated, at most how many splice junctions can be introduced?
  • ts_distance: numerical. When padding is activated, at most of how many base pairs can a transcript be extended?

Warning

the AS transcript event does not need to be a valid AS event for all transcripts in the locus, only against the primary transcript.

Note

when padding transcripts, Mikado will consider also transcripts with the same intron structure but differing end points (so “=” or “_”). These will be used to expand the UTRs of other transcripts; however, only one of these transcripts with identical structures will be reported in the end.

[pick.alternative_splicing]
# Parameters related to alternative splicing reporting.
# - report: whether to report at all or not the AS events.
# - min_cds_overlap: minimum overlap between the CDS of the primary transcript and any AS event. Default: 60%.
# - min_cdna_overlap: minimum overlap between the CDNA of the primary transcript and any AS event.
# Default: 0% i.e. disabled, we check for the CDS overlap.
# - keep_retained_introns: Whether to consider as valid AS events where one intron
# - max_isoforms: Maximum number of isoforms per locus. 1 implies no AS reported. Default: 5
# is retained compared to the primary or any other valid AS. Default: false.
# - valid_ccodes: Valid class codes for AS events. Valid codes are in categories
# 'Alternative splicing', 'Extension' (with junction F1 lower than 100%), and Overlap (exluding m). Default: j, J, g, G, C, h
# - max_utr_length: Maximum length of the UTR for AS events. Default: 10e6 (i.e. no limit)
# - max_fiveutr_length: Maximum length of the 5'UTR for AS events. Default: 10e6 (i.e. no limit)
# - max_threeutr_length: Maximum length of the 5'UTR for AS events. Default: 10e6 (i.e. no limit)
# - min_score_perc: Minimum score threshold for subsequent AS events. Only transcripts with a score at least (best) * value are retained.
# - only_confirmed_introns: bring back AS events only when their introns are either present in the primary transcript or in the set of confirmed introns.
# - pad: boolean switch. If true, Mikado will pad all the transcript in a gene so that their ends are the same
# - ts_distance: if padding, this is the maximum distance in base-pairs between the starts of transcripts to be considered to be padded together.
# - ts_max_splices: if padding, this is the maximum amount of splicing junctions that the transcript to pad is allowed to cross. If padding would lead to cross more than this number, the transcript will not be padded.
report = true
min_cds_overlap = 0.5
min_cdna_overlap = 0.6
keep_retained_introns = true
keep_cds_disrupted_by_ri = false
max_isoforms = 10
valid_ccodes = ["j", "J", "G", "h"]
redundant_ccodes = ["c", "m", "_", "=", "n"]
min_score_perc = 0.5
only_confirmed_introns = true
ts_distance = 2000
pad = true
ts_max_splices = 2
Parameters regarding the clustering of transcripts in loci

This section influences how Mikado clusters transcripts in its multi-stage selection. The available parameters are:

  • flank: numerical. When constructing Superloci, Mikado will use this value as the maximum distance between transcripts for them to be integrated within the same superlocus.
  • cds_only: boolean. If set to true, during the picking stage Mikado will consider only the primary ORF to evaluate whether two transcripts intersect. Transcripts which eg. share introns in their UTR but have completely unrelated CDSs will be clustered separately. Disabled by default.
  • purge: boolean. If true, any transcript failing the specified requirements will be purged out. Otherwise, they will be assigned a score of 0 and might potentially appear in the final output, if no other transcript is present in the locus.
  • simple_overlap_for_monoexonic: boolean. During the second clustering, by default monoexonic transcripts are clustered together even if they have a very slight overlap with another transcript. Manually setting this flag to false will cause Mikado to cluster monoexonic transcripts only if they have a minimum amount of cDNA and CDS overlap with the other transcripts in the holder.
  • min_cdna_overlap: numerical, between 0 and 1. Minimum cDNA overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail.
  • min_cdna_overlap: numerical, between 0 and 1. Minimum CDS overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail.
[pick.clustering]
# Parameters related to the clustering of transcripts into loci.
# - cds_only: boolean, it specifies whether to cluster transcripts only according to their CDS (if present).
# - min_cds_overlap: minimal CDS overlap for the second clustering.
# - min_cdna_overlap: minimal cDNA overlap for the second clustering.
# - flank: maximum distance for transcripts to be clustered within the same superlocus.
# - remove_overlapping_fragments: boolean, it specifies whether to remove putative fragments.
# - purge: boolean, it specifies whether to remove transcripts which fail the minimum requirements check - or whether to ignore those requirements altogether.
# - simple_overlap_for_monoexonic: boolean. If set to true (default), then any overlap mean inclusion in a locus for or against a monoexonic transcript. If set to false, normal controls for the percentage of overlap will apply.
# - max_distance_for_fragments: maximum distance from a valid locus for another to be considered a fragment.
cds_only = false
min_cds_overlap = 0.2
min_cdna_overlap = 0.2
purge = true
flank = 200
simple_overlap_for_monoexonic = true
Parameters regarding the detection of putative fragments

This section determines how Mikado treats potential fragments in the output. Available options:

  • remove: boolean, default true. If set to true, fragments will be excluded from the final output; otherwise, they will be printed out, but properly tagged.
  • max_distance: numerical. For non-overlapping fragments, this value determines the maximum distance from the valid gene. Eg. with the default setting of 2000, a putative fragment at the distance of 1000 will be tagged and dealt with as a fragment; an identical model at a distance of 3000 will be considered as a valid gene and left untouched.
  • valid_class_codes: valid class codes for potential fragments. Only Class Codes in the categories Overlap, Intronic, Fragment, with the addition of “_”, are considered as valid choices.
[pick.fragments]
# Parameters related to the handling of fragments.
# - remove: boolean. Whether to remove fragments or leave them, properly tagged.
# - max_distance: maximum distance of a putative fragment from a valid gene.
# - valid_class_codes: which class codes will be considered as fragments. Default: (p, P, x, X, i, m, _). Choices: '_' plus any class code with category 'Intronic', 'Fragment', or 'Overlap'.
remove = true
max_distance = 2000
valid_class_codes = ["p", "P", "x", "X", "i", "m", "_", "e", "o"]
Parameters regarding assignment of ORFs to transcripts

This section of the configuration file deals with how to determine valid ORFs for a transcript from those present in the database. The parameters to control the behaviour of Mikado are the following:

  • minimal_orf_length: minimal length of the primary ORF to be loaded onto the transcript. By default, this is set at 50 bps (not aminoacids)
  • minimal_secondary_orf_length: minimal length of any ORF that can be assigned to the transcript after the first. This value should be set at a higher setting than minimal_orf_length, in order to avoid loading uORFs [uORFs] into the transcript, leading to spurious break downs of the UTRs. Default: 200 bps.
  • strand_specific: boolean. If set to true, only ORFs on the plus strand (ie the same of the cDNA) will be considered. If set to false, monoexonic transcripts mihgt have their strand flipped.
[pick.orf_loading]
# Parameters related to ORF loading.
# - minimal_secondary_orf_length: Minimum length of a *secondary* ORF to be loaded after the first, in bp. Default: 200 bps
# - minimal_orf_length: Minimum length in bps of an ORF to be loaded, as the primary ORF, onto a transcript. Default: 50 bps
# - strand_specific: Boolean flag. If set to true, monoexonic transcripts will not have their ORF reversed even if they would have an ORF on the opposite strand.
minimal_secondary_orf_length = 200
minimal_orf_length = 50
strand_specific = true
Parameters regarding splitting of chimeras

This section determines how Mikado will deal with chimeras. These are the relevant parameters:

  • execute: boolean. If set to false, Mikado will operate in the nosplit mode. If set to true, the choice of the mode will be determined by the other parameters.
  • skip: this is list of input assemblies (identified by the label in prepare, above) that will never have the transcripts split.

Hint

cDNAs, reference transcripts, and the like should end up in the “skip” category. These are, after all, transcripts that are presupposed to be originated from a single RNA molecule and therefore without fusions.

  • blast_check: boolean. Whether to execute the check on the BLAST hits. If set to false, Mikado will operate in the split mode, unless execute is set to false (execute takes precedence over the other parameters).

  • blast_params: this section contains the settings relative to the permissive, lenient and stringent mode.

    • evalue: maximum evalue of a hit to be assigned to the transcript and therefore be considered.
    • hsp_evalue: maximum evalue of a hsp inside a hit to be considered for the analysis.
    • leniency: one of LENIENT, PERMISSIVE, STRINGENT. See above for definitions.
    • max_target_seqs: integer. when loading BLAST hits from the database, only the first N will be considered for analysis.
    • minimal_hsp_overlap: number between 0 and 1. This indicates the overlap that must exist between the HSP and the ORF for the former to be considered for the split.
    • min_overlap_duplication: in the case of tandem duplicated genes, a chimera will have two ORFs that share the same hits, but possibly in a peculiar way - the HSPs will insist on the same region of the target sequence. This parameter controls how much overlap counts as a duplication. The default value is of 0.9 (90%).
[pick.chimera_split]
# Parameters related to the splitting of transcripts in the presence of
# two or more ORFs. Parameters:
# - execute: whether to split multi-ORF transcripts at all. Boolean.
# - blast_check: whether to use BLAST information to take a decision. See blast_params for details.
# - blast_params: Parameters related to which BLAST data we want to analyse.
blast_check = true
execute = true
skip = [false, false, false, false, false]

[pick.chimera_split.blast_params]
# Parameters for the BLAST check prior to splitting.
# - evalue: Minimum evalue for the whole hit. Default: 1e-6
# - hsp_evalue: Minimum evalue for any HSP hit (some might be discarded even if the whole hit is valid). Default: 1e-6
# - leniency: One of 'STRINGENT', 'LENIENT', 'PERMISSIVE'. Default: STRINGENT
# - max_target_seqs: maximum number of hits to consider. Default: 3
# - minimal_hsp_overlap: minimum overlap of the ORF with the HSP (*not* reciprocal). Default: 0.8, i.e. 80%
# - min_overlap_duplication: minimum overlap (in %) for two ORFs to consider them as target duplications. This means that if two ORFs have no HSPs in common, but the coverage of their disjoint HSPs covers more than this % of the length of the *target*, they represent most probably a duplicated gene.
evalue = 1e-06
hsp_evalue = 1e-06
leniency = "PERMISSIVE"
max_target_seqs = 3
minimal_hsp_overlap = 0.5
min_overlap_duplication = 0.8
Parameters regarding input and output files

The “files” and “output_format” sections deal respectively with input files for the pick stage and with some basic settings for the GFF output. Options:

  • input: input GTF file for the run. It should be the one generated by the prepare stage, ie the out file of the prepare stage.
  • loci_out: main output file. It contains the winning transcripts, separated in their own gene loci, in GFF3 format. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
  • log: name of the log file. Default: mikado_pick.log
  • monoloci_out: this optional output file will contain the transcripts that have been passed to the monoloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
  • subloci_out: this optional output file will contain the transcripts that have been passed to the subloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
[pick.files]
 # Input and output files for Mikado pick.
 # - gff: input GTF/GFF3 file. Default: mikado_prepared.gtf
 # - loci_out: output GFF3 file from Mikado pick. Default: mikado.loci.gff3
 # - subloci_out: optional GFF file with the intermediate subloci. Default: no output
 # - monoloci_out: optional GFF file with the intermediate monoloci. Default: no output
 # - log: log file for this step.
 output_dir = "."
 input = "mikado_prepared.gtf"
 loci_out = "mikado.loci.gff3"
 subloci_out = ""
 monoloci_out = ""
 log = "pick.log"
Parameters regarding the output format

Available parameters:

  • id_prefix: prefix for all the final Mikado models. The ID will be <prefix>.<chromosome>G<progressive ID>.
  • report_all_orfs: some Mikado models will have more than one ORF (unless pick is operating in the split mode). If this option is set to true, Mikado will report the transcript multiple times, one for each ORF, using different progressive IDs (<model name>.orf<progressive ID>). By default, this option is set to False, and only the primary ORF is reported.
  • source: prefix for the source field in the output files. Loci GFF3 will have “<prefix>_loci”, subloci GFF3s will have “<prefix>_subloci”, and monoloci will have “<prefix>_monoloci”.
Generic parameters on the pick run

This section deals with other parameters necessary for the run, such as the number of processors to use, but also more important algorithmic parameters such as how to recognise fragments.

Parameters:

  • exclude_cds: whether to remove CDS/UTR information from the Mikado output. Default: false.
  • intron_range: tuple that indicates the range of lengths in which most introns should fall. Transcripts with introns either shorter or longer than this interval will be potentially penalised, depending on the scoring scheme. For the paper, this parameter was set to a tuple of integers in which 98% of the introns of the reference annotation were falling (ie cutting out the 1st and 99th percentiles).
  • shm: boolean. In certain cases, especially when disk access is a severely limiting factor, it might make sense to copy a SQLite database into RAM before querying. If this parameter is set to true, Mikado will copy the SQLite database into a temporary file in RAM, and query it from there.
  • only_reference_update:
  • check_references:
  • single_thread: boolean. If set to true, Mikado will completely disable multiprocessing. Useful mostly for debugging reasons.

Warning

the shared-memory options are available only on Linux platforms.

[pick.run_options]
# Generic run options.
# - shm: boolean flag. If set and the DB is sqlite, it will be copied onto the /dev/shm faux partition
# - exclude_cds: boolean flag. If set, the CDS information will not be printed in Mikado output. Default: false
# - single_thread: boolean flag. If set, multithreading will be disabled - useful for profiling and debugging.
shm = false
exclude_cds = false
intron_range = [60, 10000]
only_reference_update = false
check_references = false
single_thread = false
Technical details

The configuration file obeys a specific schema defined by the Mikado.configuration.daijin_configuration or the Mikado.configuration.configuration. Every time a Mikado utility is launched, it checks the configuration file against the schema defined by those classes to validate it.

Mikado prepare

This is the first executive step of the Mikado pipeline. It will accomplish the following goals:

  1. Collect annotations from disparate annotation files.
  2. Remove redundant assemblies, ie, assemblies that are identical across the various input files.
  3. Determine the strand of the transcript junctions.
  4. Ensure uniqueness of the transcript names.
  5. Order the transcript by locus.
  6. Extract the transcript sequences.
Usage

Mikado prepare allows to override some of the parameters present in the configuration file through command line options, eg. the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.

Available parameters:

  • configuration: the most important parameter. This is the configuration file created through Mikado configure.
  • fasta: reference genome. Required, either through the command line or through the configuration file.
  • out: Output GTF file, with the collapsed transcripts.
  • out_fasta: Output FASTA file of the collapsed transcripts.
  • start-method: multiprocessing start method.
  • verbose, quiet: flags to set the verbosity of Mikado prepare. It is generally not advised to turn the verbose mode on, unless there is a problem to debug, given the verbosity of the output.
  • strand-specific: If set, all assemblies will be treated as strand-specific.
  • strand-specific-assemblies: comma-separated list of strand specific assemblies.
  • strip-cds: some aligners (eg GMAP) will try calculate a CDS on the fly for alignments. Use this flag to discard all CDS information from all input transcripts.
  • exclude-redundant: if set, this flag instructs Mikado to look for and simplify redundant intron chains. By default, this option is disabled, or enabled on a per-sample basis. See this section for an explanation of redundancy removal in Mikado, and the section on the list input files below for an explanation on how to set this value on a per-sample basis (recommended).
  • codon-table: Mikado prepare will check the validity of the ORFs of input models. This value indicates which codon table Mikado should use for this purpose. See the section on the checks on CDSs.
  • strip-faulty-cds: when encountering a transcript with an invalid ORF due to e.g. in-frame stop codons, Mikado will usually discard the whole transcript. If this flag is set, Mikado will instead remove the CDS information and leave the transcript in place.
  • list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with details of the files to use. See this section for details.
  • log: log file. Optional, by default Mikado will print to standard error.
  • lenient: flag. If set, multiexonic transcripts without any canonical splice site will be output as well. By default, they would be discarded.
  • minimum-cdna-length: minimum length of the transcripts to be kept, default 200 bps.
  • max-intron-size: maximum length of introns for non-reference transcripts, default 1,000,000 bps. Transcripts with introns longer than this will be split into multiple pieces. Note: transcripts marked as “reference” will never be split in this way; Mikado will simply emit a warning in the log.
  • seed: integer seed to use for reproducibility. By default, Mikado will use the seed set in the configuration file.
  • random-seed: boolean switch. If selected, Mikado will use a random seed selected at runtime (and reported in the log)
  • single: flag that disables multiprocessing. Mostly useful for debug purposes.

Command line usage:

$ mikado prepare --help
usage: Mikado prepare [-h] [--fasta REFERENCE] [--verbose | --quiet | -lv {DEBUG,INFO,WARN,ERROR}] [--start-method {fork,spawn,forkserver}] [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST] [-l LOG]
                  [--lenient] [-m MINIMUM_CDNA_LENGTH] [-MI MAX_INTRON_LENGTH] [-p PROCS] [-scds] [--labels LABELS] [--codon-table CODON_TABLE] [--single] [-od OUTPUT_DIR] [-o OUT] [-of OUT_FASTA]
                  [--configuration CONFIGURATION] [-er] [--strip-faulty-cds] [--seed SEED | --random-seed]
                  [gff [gff ...]]

positional arguments:
  gff                   Input GFF/GTF file(s).

optional arguments:
  -h, --help            show this help message and exit
  --fasta REFERENCE, --reference REFERENCE
                        Genome FASTA file. Required.
  --verbose
  --quiet
  -lv {DEBUG,INFO,WARN,ERROR}, --log-level {DEBUG,INFO,WARN,ERROR}
                        Log level. Default: derived from the configuration; if absent, INFO
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  -s, --strand-specific
                        Flag. If set, monoexonic transcripts will be left on their strand rather than being moved to the unknown strand.
  -sa STRAND_SPECIFIC_ASSEMBLIES, --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
                        Comma-delimited list of strand specific assemblies.
  --list LIST           Tab-delimited file containing rows with the following format: <file> <label> <strandedness(def. False)> <score(optional, def. 0)> <is_reference(optional, def. False)>
                        <exclude_redundant(optional, def. True)> <strip_cds(optional, def. False)> <skip_split(optional, def. False)> "strandedness", "is_reference", "exclude_redundant", "strip_cds" and
                        "skip_split" must be boolean values (True, False) "score" must be a valid floating number.
  -l LOG, --log LOG     Log file. Optional.
  --lenient             Flag. If set, transcripts with only non-canonical splices will be output as well.
  -m MINIMUM_CDNA_LENGTH, --minimum-cdna-length MINIMUM_CDNA_LENGTH
                        Minimum length for transcripts. Default: 200 bps.
  -MI MAX_INTRON_LENGTH, --max-intron-size MAX_INTRON_LENGTH
                        Maximum intron length for transcripts. Default: 1,000,000 bps.
  -p PROCS, --procs PROCS
                        Number of processors to use (default None)
  -scds, --strip_cds    Boolean flag. If set, ignores any CDS/UTR segment.
  --labels LABELS       Labels to attach to the IDs of the transcripts of the input files, separated by comma.
  --codon-table CODON_TABLE
                        Codon table to use. Default: 0 (ie Standard, NCBI #1, but only ATG is considered a valid start codon.
  --single, --single-thread
                        Disable multi-threading. Useful for debugging.
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -o OUT, --out OUT     Output file. Default: mikado_prepared.gtf.
  -of OUT_FASTA, --out_fasta OUT_FASTA
                        Output file. Default: mikado_prepared.fasta.
  --configuration CONFIGURATION, --json-conf CONFIGURATION
                        Configuration file.
  -er, --exclude-redundant
                        Boolean flag. If invoked, Mikado prepare will exclude redundant models,ignoring the per-sample instructions.
  --strip-faulty-cds    Flag. If set, transcripts with an incorrect CDS will be retained but with their CDS stripped. Default behaviour: the whole transcript will be considered invalid and discarded.
  --seed SEED           Random seed number. Default: 0.
  --random-seed         Generate a new random seed number (instead of the default of 0)
Collection of transcripts from the annotation files

Different assemblers will produce data in different formats, typically in GFF or GTF format, and not necessarily in the same order (if any is present). Mikado will serialise the transcripts from these files and port them all into a standard GTF format. Moreover, it will ensure that each transcript ID appears only once across the input files. The optional labels provided for each file will be attached to the transcript names as prefixes, and used as the source field in the output GTF, to ensure the uniqueness of each transcript name. If two or more transcripts are found to be identical, only one will be retained, chosen at random among all the possibilities. In addition to this, Mikado prepare will also sort the transcripts by coordinate, irrespective of strand, so that they are suitably displayed for the divide-et-impera algorithm of Mikado pick.

When two or more identical transcripts are present in a locus, Mikado will use the (optionally provided) source score to select the a priori best assembly amongst the choices. For example, if a mikado prepare run comprises both PacBio reads and Illumina assemblies and the experimenter has given a score of 1 or more to the former dataset but not the latter, if a PacBio read is present together with a stringtie assembly, the PacBio read will always be selected over the StringTie. Please note that this “score-based” selection *only operates for transcripts that are identical. No other selection is performed at this stage.

Warning

To be considered identical, two transcripts must match down to the last base pair. A simple match or containment of the intron chain will not suffice. This is because using the cDNA data alone it is difficult to understand whether the longer form(s) is the correct assembly rather than a chimera or a trans-splice event.

Note

From version 1.3 onwards, Mikado considers the CDS as well when performing the redundancy check. So, two transcripts having the same coordinates but different CDS (because of non-overlapping ORFs or disagrement on the frame and/or start codon position) will be kept as non-redundant.

Note

Transcripts that are considered to come from a “reference” assembly are never going to be excluded, and will always be prioritised over other assemblies.

Removal of redundant transcripts

Many third-party tools, e.g. gffread [GffRead], try to simplify transcript assemblies by grouping transcripts according to their intron chains and then keeping only one transcript per group, usually the longest. This removes transcripts with identical intron chains as well as transcripts whose intron chain is completely contained within another one in the group. In most cases, Mikado explicitly does not take this approach because, especially with RNASeq assemblies, longer transcripts might not necessarily be the most correct; rather, in a non-negligible portion of cases, longer transcripts might have originated by an artefactual fusion of two different, neighbouring transcripts. The implicit assumption made by e.g. gffread (that shorter transcripts are the result of fragmentation of the longer transcripts) would therefore lead to incorrect assemblies. The default approach taken by Mikado, therefore, is to identify cases where transcripts are completely identical (both in terms of cDNA and CDS, if kept), and only remove redundancies in those rare, specific cases.

In certain situations, however, a strategy based on intron chain redundancy like in gffread might be warranted. Specifically:

  • in long read datasets (e.g. PacBio or ONT alignments) the implicit assumption made by gffread is valid: in these cases it is safe to assume that fragmentation during RNA extraction and library preparation would constitute the main origin of redundancy.
  • when dealing with massive transcript datasets (>=5-10 million transcripts), removing excess transcripts might be necessary to keep the analysis manageable, at the cost of slightly reduced accuracy.

Mikado allows to perform this more extensive redundancy removal either on a per-analysis or per-sample (recommended) basis. When scanning the transcript assemblies, Mikado will look for intron chains that are completely contained within another. When such an occurence arises, if and only if Mikado has been instructed to remove redundant cases, Mikado will do the following:

  • if one of the two transcripts comes from a sample for which the redundancy removal is disabled (including, automatically, all “reference” samples), it will always be kept.
  • if the transcript marked for redundancy check and removal has a lower baseline score or is contained within the other transcript, it will be marked for removal.
Check on strand correctness

During its run, Mikado prepare will also check the correctness of the transcripts. In particular:

  • Unless the assembly is marked as strand-specific, any monoexonic transcript will have its strand removed.
  • If a transcript contains canonical splice junctions on both strands, it will be completely removed
  • If a transcript contains only non-canonical splice junctions, it will be removed unless the “lenient” option is specified either at the command line or in the configuration file.

The couples of splice acceptors and donors which are considered as canonical can be specified in the configuration file. By default, Mikado will consider as canonical both properly canonical splicing event (GT-AG) as well as the semi-canonical events (GC-AG, AT-AC). Any other couple will be considered as non-canonical.

Warning

Mikado will check the strand of each junction inside a transcript independently. Therefore, if a transcript with 9 junctions on the plus strand is found to have a non-canonical splicing junction which happens to be the reverse of a canonical one (eg. CT-AC), it will deem this junction as misassigned to the wrong strand and flip it to the minus strand. In this example, the transcript will therefore be considered as an error as it contains both + and - junctions. If Mikado prepare is running in lenient mode, these transcripts will be marked as having a “suspicious_splicing” and kept in the output. Otherwise, they will be discarded outright.

Warning

The same considerations as above apply to transcripts that have no recognisable strand for their junctions. Such transcripts will be removed outright if lenient is False.

Note

Starting from Mikado version 1.3, transcripts can be tagged as being from an assembly of “reference” quality.

This implies that:

  • A transcript which is marked as “reference” will never have its CDS stripped
  • A transcript which is marked as “reference” will never be marked for removal due to redundancy, even if there are multiple copies of it, or if other assemblies with a higher score have identical transcripts (normally only one transcript would be retained, and that would be chosen amongst the highest scoring assemblies)
  • A transcript which is marked as reference will never have its strand removed or flipped.

Please see the configuration help page for details.

Check on ORF correctness

Mikado will check that the input transcripts have a formally valid ORF both in terms of the structure as well as of its sequence.

By “formally correct ORF structure”, Mikado means that:

  • CDS segments must be contained within declared exons
  • CDS segments should not overlap each other or any declared UTR segment
  • there should be no gap between CDS segments on a transcript’s cDNA
  • including the initial phase, the length of the CDS should be a multiple of three.

By “formally correct ORF sequence”, Mikado means that:

  • there are no internal in-frame stop codons

To perform the latter check, Mikado will use the specified codon table (by default “0”, ie the NCBI Standard codon table but only considering “ATG” as a valid start codon).

Finally, Mikado will infer whether the transcript has a start and/or stop codon (again, using the selected codon table) and tag it appropriately.

Output files

Mikado prepare will produce two files:

  • a sorted GTF file, containing all the transcripts surviving the checks
  • a FASTA file of the transcripts, in the proper cDNA orientation.

Warning

contrary to other tools such as eg gffread [GffRead], Mikado prepare will not try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will not have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately.

Mikado serialise

Mikado integrates data from multiple sources to select the best transcripts. During this step, these sources are brought together inside a common database, simplifying the process of retrieving them at runtime. Currently, Mikado integrates three or more different types of data:

  1. reliable junctions, in BED12 format
  2. ORF data, currently identified using TransDecoder or Prodigal. Any program capable of generating it in BED12 or GFF3 format is suitable.
  3. BLASTX data (from [Blastplus] or [Diamond])
  4. Miscellaneous numeric scores can also be given as input to Mikado in a tabular file

After serialisation in the database, these sources will be available to use for any subsequent Mikado pick run. Having the data present in the database allows to run Mikado with multiple configurations and little overhead in terms of pre-loading data; this feature is taken directly advantage of in The Daijin pipeline for driving Mikado, where it is possible to run Mikado using multiple modes.

Mikado serialise can use three different SQL databases as backends - SQLite, MySQL and PostgreSQL - thanks to SQLAlchemy. This step, together with the creation of the TransDecoder and BLAST data, is the most time consuming of the pipeline. In particular, although Mikado serialise will try to analyse the XML data in a parallelised fashion if so instructed, the insertion of the data in the database will still happen in a single thread and will therefore be of limited speed. If using SQLite as database (the default option), it is possible to decrease the runtime by modifying the “max_objects” parameters, at the cost however of increased RAM usage.

Important

The schema of Mikado databases changed with version 1.3. Any database created prior to this version should be regenerated, otherwise Mikado pick will fail.

BLAST files

Mikado serialise is capable of using homology results derived from either BLAST+ [Blastplus] or DIAMOND. Further, it is capable of analysing results provided in multiple formats:

  • Tabular BLAST file. This is now the preferred format. The file needs to have been created with the following field specification: - “qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop”
  • XML BLAST.
  • ASN files, i.e. the BLAST+ archive format. These will be converted on-the-fly to tabular files with the appropriate keys.
  • DAA files, i.e. the DIAMOND archive format. These will be converted on-the-fly to tabular files with the appropriate keys.

Warning

ASN files will need BLAST+ to be installed in the same environment/instance of Mikado, otherwise the on-the-fly conversion will fail. Likewise, DAA files require DIAMOND to be installed in the same environment/instance of Mikado.

All possible input files can be optionally compressed using GZIP or BZIP2; Mikado will handle the conversion. The only exception are DAA files, as these are already compressed by default.

Hint

The most expensive operation in a “Mikado serialise” run is by far the serialisation of BLAST XML files. Using tabular files significantly speeds up the process, as they are much smaller and faster to read and parse. In case XML files are used, splitting the input files in multiple chunks, and analysing them separately, allows for better parallelisation. If a single monolythic XML/ASN file is produced, by contrast, Mikado will be quite slow.

Transdecoder ORFs

When Mikado analyses ORFs produced by TransDecoder, Prodigal or equivalent program, it performs additionally the following checks:

  1. Check the congruence between the length of the transcript in the BED12 file and that found in the FASTA file
  2. Check that the ORF does not contain internal stop codons
  3. Check that the CDS length is valid, ie a multiple of 3, if the ORF is complete
  4. Optionally, if the ORF is open on the 5’ side, Mikado can try to find an internal start codon. See this section for details.

Mikado can accept the data in GFF3 format (the standard for Prodigal) or BED12 (preferred for TransDecoder).

Reliable junctions

Some aligners (e.g. STAR) are capable of returning a list of splice junctions, inferred by the read alignments, that they repute to be extremely reliable. Likewise, our tool Portcullis is capable of analysing one or multiple BAM alignment files and return a list of junctions that are well-supported by the data (e.g. high coverage of diverse reads around the junction, with few or no mismatches) and therefore to be considered more trustworthy from the rest. Mikado actively makes use of this splice junction data to score and select transcripts.

We require the data to be provided in BED12 format (see sidebar for Mikado-specific details). Portcullis provides utilities for converting junction data and/or merge multiple datasets into a single file, through its junctools command line utility.

Additionally, Mikado is capable of interpreting BED12 files of manually curated genes as a source of reliable junctions. Please see the sidebar for details.

Additional scores

Aside from the above datasets, Mikado can further integrate scores from different sources, such as structural coherence with known annotations from other species, coding-potential calculations, or expression data. To do so, Mikado needs a tab-delimited file with all transcript IDs present in the first column, which should be marked as tid. As an example of a valid file, with two columns (tpm ie expression data, and CPC ie coding potential):

tid tpm CPC
at_AT5G66600.2 24260.8 1
at_AT5G66600.3 121.857 1
at_AT5G66600.4 0 1
at_AT5G66600.1 4775.2 1
cuff_cufflinks_star_at.23553.1 6358.42 .7
cl_Chr5.6272 0 .3
cl_Chr5.6271 0 .2

Our pipeline Minos, in development, creates and uses such tables to prioritise transcripts more effectively.

Usage

mikado serialise allows to override some of the parameters present in the configuration file through command line options, eg the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.

Available parameters:

  • Parameters related to performance:

    • start-method: one of fork, spawn, forkserver. It determines the multiprocessing start method. By default, Mikado will use the default for the system (fork on UNIX, spawn on Windows).
    • procs: Number of processors to use.
    • single-thread: flag. If set, Mikado will disable all multithreading.
    • max_objects: Maximum number of objects to keep in memory before committing to the database. See this section of the configuration for details.
  • Basic input data and settings:

    • output-dir: directory where the SQLite database and the log will be written to.
    • transcripts: these are the input transcripts that are present on the GTF file considered by Mikado. Normally this should be the output of Mikado prepare.
    • genome_fai: FAIDX file of the genome FASTA. If not given, serialise will derive it from the “reference: genome” field of the configuration.
    • force: flag. If set, and the database is already present, it will be truncated rather than updated.
    • json-conf: this is the configuration file created with Mikado configure.
    • db: if the database is specified on the command line, mikado serialise will interpret it as a SQLite database. This will overwrite any setting present in the configuration file.
  • Parameters related to logging:

    • log: log file. It defaults to serialise.log.
    • log_level: verbosity of the logging. Please be advised that excessive verbosity can negatively impact the performance of the program - the debug mode is extremely verbose.
  • Parameters related to reliable junctions:

  • Parameters related to the treatment of ORF data:

    • orfs: ORF BED12 / GFF3 files, separated by comma.
    • max-regression: A percentage, expressed as a number between 0 and 1, which indicates how far can Mikado regress along the ORF to find a valid start codon. See the relative section in the configuration for details.
    • no-start-adjustment: if selected, Mikado will not try to correct the start position in ORFs and will consider them as provided.
    • codon-table: this parameter specifies the codon table to use for the project. Mikado by default uses the NCBI codon table 1 (standard with eukaryotes) with the modification that only ATG is considered as a valid start codon, as ORF predictions usually inflate the number of non-standard starts.
  • Parameters related to BLAST data:

    • blast_targets: BLAST FASTA database.
    • xml: BLAST files to parse. This can be either a comma-seprated list of valid files, or a comma-separated list of folders containing valid files. Please note that, notwithstanding the name of the flag, Mikado will also accept valid TSV files here (see above).
    • max-target-seqs: maximum number of BLAST targets that can be loaded per sequence, for each BLAST alignment. Please note that if you align against multiple databases, this threshold will be applied once per file.
  • Parameters related to additional scores:
    • external-scores: a tab-delimited file of additional scores for the transcripts; one row per transcript.

Warning

It is advised to set this parameter to spawn even on UNIX. See the dedicated sidebar for details.

Usage:

$ mikado serialise --help
usage: Mikado serialise [-h] [--start-method {fork,spawn,forkserver}] [--orfs ORFS] [--transcripts TRANSCRIPTS] [-mr MAX_REGRESSION] [--codon-table CODON_TABLE] [-nsa] [--max-target-seqs MAX_TARGET_SEQS]
                    [-bt BLAST_TARGETS] [--xml XML] [-p PROCS] [--single-thread] [--genome_fai GENOME_FAI] [--genome GENOME] [--junctions JUNCTIONS] [--external-scores EXTERNAL_SCORES] [-mo MAX_OBJECTS]
                    [--no-force | --force] [--configuration CONFIGURATION] [-l [LOG]] [-od OUTPUT_DIR] [-lv {DEBUG,INFO,WARN,ERROR} | --verbose | --quiet | --blast-loading-debug]
                    [--seed SEED | --random-seed]
                    [db]

optional arguments:
  -h, --help            show this help message and exit
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  --no-force            Flag. If set, do not drop the contents of an existing Mikado DB before beginning the serialisation.
  --force               Flag. If set, an existing databse will be deleted (sqlite) or dropped (MySQL/PostGreSQL) before beginning the serialisation.
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -lv {DEBUG,INFO,WARN,ERROR}, --log-level {DEBUG,INFO,WARN,ERROR}
                        Log level. Default: derived from the configuration; if absent, INFO
  --verbose
  --quiet
  --blast-loading-debug
                        Flag. If set, Mikado will switch on the debug mode for the XML/TSV loading.
  --seed SEED           Random seed number. Default: 0.
  --random-seed         Generate a new random seed number (instead of the default of 0)

  --orfs ORFS           ORF BED file(s), separated by commas
  --transcripts TRANSCRIPTS
                        Transcript FASTA file(s) used for ORF calling and BLAST queries, separated by commas. If multiple files are given, they must be in the same order of the ORF files. E.g. valid command
                        lines are: --transcript_fasta all_seqs1.fasta --orfs all_orfs.bed --transcript_fasta seq1.fasta,seq2.fasta --orfs orfs1.bed,orf2.bed --transcript_fasta all_seqs.fasta --orfs
                        orfs1.bed,orf2.bed These are invalid instead: # Inverted order --transcript_fasta seq1.fasta,seq2.fasta --orfs orfs2.bed,orf1.bed #Two transcript files, one ORF file
                        --transcript_fasta seq1.fasta,seq2.fasta --orfs all_orfs.bed
  -mr MAX_REGRESSION, --max-regression MAX_REGRESSION
                        "Amount of sequence in the ORF (in %) to backtrack in order to find a valid START codon, if one is absent. Default: None
  --codon-table CODON_TABLE
                        Codon table to use. Default: 0 (ie Standard, NCBI #1, but only ATG is considered a valid start codon.
  -nsa, --no-start-adjustment
                        Disable the start adjustment algorithm. Useful when using e.g. TransDecoder vs 5+.

  --max-target-seqs MAX_TARGET_SEQS
                        Maximum number of target sequences.
  -bt BLAST_TARGETS, --blast-targets BLAST_TARGETS, --blast_targets BLAST_TARGETS
                        Target sequences
  --xml XML, --tsv XML  BLAST file(s) to parse. They can be provided in three ways: - a comma-separated list - as a base folder - using bash-like name expansion (*,?, etc.). In this case, you have to enclose
                        the filename pattern in double quotes. Multiple folders/file patterns can be given, separated by a comma. BLAST files must be either of two formats: - BLAST XML - BLAST tabular
                        format, with the following **custom** fields: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop
  -p PROCS, --procs PROCS
                        Number of threads to use for analysing the BLAST files. This number should not be higher than the total number of XML files.
  --single-thread       Force serialise to run with a single thread, irrespective of other configuration options.

  --genome_fai GENOME_FAI
  --genome GENOME
  --junctions JUNCTIONS

  --external-scores EXTERNAL_SCORES
                        Tabular file containing external scores for the transcripts. Each column should have a distinct name, and transcripts have to be listed on the first column.

  -mo MAX_OBJECTS, --max-objects MAX_OBJECTS
                        Maximum number of objects to cache in memory before committing to the database. Default: 100,000 i.e. approximately 450MB RAM usage for Drosophila.
  --configuration CONFIGURATION, --json-conf CONFIGURATION
  -l [LOG], --log [LOG]
                        Optional log file. Default: stderr
  db                    Optional output database. Default: derived from configuration
Technical details

The schema of the database is quite simple, as it is composed only of 9 discrete tables in two groups. The first group, chrom and junctions, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations. The second group serialises the data regarding ORFs, BLAST files and external arbitrary data. The need of using a database is mainly driven by the need to avoid calculating all necessary information at runtime every time mikado pick is launched.

Database schema used by Mikado.

_images/database_schema.png
Mikado pick

This is the final stage of the pipeline, in which Mikado identifies gene loci and selects the best transcripts.

Input files

mikado pick requires as input files the following:

  1. A sorted GTF files with unique transcript names, derived through the prepare stage.
  2. A database containing all the external data to be integrated with the transcript structure, derived through the serialisation stage.
  3. A scoring file specifying the minimum requirements for transcripts and the relevant metrics for scoring. See the section on scoring files for details.
Output files

mikado pick will produce three kinds of output files: a GFF3 file, a metrics file, and a scores file. This triad will be produced for the loci level, and optionally also for the subloci and monoloci level.

GFF3 files

This output file is a standard-compliant GFF, with the addition of the superloci to indicate the original spans. An example with two superloci, one on the negative and one on the positive strand, follows:

##gff-version 3
##sequence-region Chr5 1 26975502
Chr5    Mikado_loci     superlocus      26584796        26601707        .       +       .       ID=Mikado_superlocus:Chr5+:26584796-26601707;Name=superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     gene    26584796        26587912        23      +       .       ID=mikado.Chr5G1;Name=mikado.Chr5G1;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26584796        26587912        24      +       .       ID=mikado.Chr5G1.2;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.2;alias=st_Stringtie_STAR.21710.1;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;ccode=j;cov=25.165945;primary=False
Chr5    Mikado_loci     exon    26584796        26584879        .       +       .       ID=mikado.Chr5G1.2.exon1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     five_prime_UTR  26584796        26584879        .       +       .       ID=mikado.Chr5G1.2.five_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585220        26585273        .       +       .       ID=mikado.Chr5G1.2.exon2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     five_prime_UTR  26585220        26585222        .       +       .       ID=mikado.Chr5G1.2.five_prime_UTR2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585223        26585273        .       +       0       ID=mikado.Chr5G1.2.CDS1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585345        26585889        .       +       0       ID=mikado.Chr5G1.2.CDS2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585345        26585889        .       +       .       ID=mikado.Chr5G1.2.exon3;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585982        26586102        .       +       1       ID=mikado.Chr5G1.2.CDS3;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585982        26586102        .       +       .       ID=mikado.Chr5G1.2.exon4;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586217        26586294        .       +       0       ID=mikado.Chr5G1.2.CDS4;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586217        26586294        .       +       .       ID=mikado.Chr5G1.2.exon5;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586420        26586524        .       +       0       ID=mikado.Chr5G1.2.CDS5;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586420        26586524        .       +       .       ID=mikado.Chr5G1.2.exon6;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586638        26586850        .       +       0       ID=mikado.Chr5G1.2.CDS6;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586638        26586850        .       +       .       ID=mikado.Chr5G1.2.exon7;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586934        26586996        .       +       0       ID=mikado.Chr5G1.2.CDS7;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586934        26586996        .       +       .       ID=mikado.Chr5G1.2.exon8;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587084        26587202        .       +       0       ID=mikado.Chr5G1.2.CDS8;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587084        26587202        .       +       .       ID=mikado.Chr5G1.2.exon9;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587287        26587345        .       +       1       ID=mikado.Chr5G1.2.CDS9;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587287        26587345        .       +       .       ID=mikado.Chr5G1.2.exon10;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587427        26587755        .       +       2       ID=mikado.Chr5G1.2.CDS10;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587427        26587912        .       +       .       ID=mikado.Chr5G1.2.exon11;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     three_prime_UTR 26587756        26587912        .       +       .       ID=mikado.Chr5G1.2.three_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     mRNA    26584930        26587912        23      +       .       ID=mikado.Chr5G1.1;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.1;alias=st_Stringtie_STAR.21710.3;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;cov=2.207630;primary=True
Chr5    Mikado_loci     exon    26584930        26585023        .       +       .       ID=mikado.Chr5G1.1.exon1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     five_prime_UTR  26584930        26585023        .       +       .       ID=mikado.Chr5G1.1.five_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585220        26585273        .       +       .       ID=mikado.Chr5G1.1.exon2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     five_prime_UTR  26585220        26585222        .       +       .       ID=mikado.Chr5G1.1.five_prime_UTR2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585223        26585273        .       +       0       ID=mikado.Chr5G1.1.CDS1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585345        26585889        .       +       0       ID=mikado.Chr5G1.1.CDS2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585345        26585889        .       +       .       ID=mikado.Chr5G1.1.exon3;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585982        26586102        .       +       1       ID=mikado.Chr5G1.1.CDS3;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585982        26586102        .       +       .       ID=mikado.Chr5G1.1.exon4;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586217        26586294        .       +       0       ID=mikado.Chr5G1.1.CDS4;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586217        26586294        .       +       .       ID=mikado.Chr5G1.1.exon5;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586420        26586524        .       +       0       ID=mikado.Chr5G1.1.CDS5;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586420        26586524        .       +       .       ID=mikado.Chr5G1.1.exon6;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586638        26586850        .       +       0       ID=mikado.Chr5G1.1.CDS6;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586638        26586850        .       +       .       ID=mikado.Chr5G1.1.exon7;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586934        26586996        .       +       0       ID=mikado.Chr5G1.1.CDS7;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586934        26586996        .       +       .       ID=mikado.Chr5G1.1.exon8;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587084        26587202        .       +       0       ID=mikado.Chr5G1.1.CDS8;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587084        26587202        .       +       .       ID=mikado.Chr5G1.1.exon9;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587287        26587345        .       +       1       ID=mikado.Chr5G1.1.CDS9;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587287        26587345        .       +       .       ID=mikado.Chr5G1.1.exon10;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587427        26587755        .       +       2       ID=mikado.Chr5G1.1.CDS10;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587427        26587912        .       +       .       ID=mikado.Chr5G1.1.exon11;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     three_prime_UTR 26587756        26587912        .       +       .       ID=mikado.Chr5G1.1.three_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     gene    26588402        26592561        20      +       .       ID=mikado.Chr5G2;Name=mikado.Chr5G2;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26588402        26592561        24      +       .       ID=mikado.Chr5G2.2;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.2;alias=st_Stringtie_STAR.21710.9.split1;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;ccode=j;cov=0.000000;primary=False
Chr5    Mikado_loci     exon    26588402        26588625        .       +       .       ID=mikado.Chr5G2.2.exon1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     five_prime_UTR  26588402        26588625        .       +       .       ID=mikado.Chr5G2.2.five_prime_UTR1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26589203        26589279        .       +       .       ID=mikado.Chr5G2.2.exon2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     five_prime_UTR  26589203        26589237        .       +       .       ID=mikado.Chr5G2.2.five_prime_UTR2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26589238        26589279        .       +       0       ID=mikado.Chr5G2.2.CDS1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26589386        26590167        .       +       0       ID=mikado.Chr5G2.2.CDS2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26589386        26590167        .       +       .       ID=mikado.Chr5G2.2.exon3;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590261        26590393        .       +       1       ID=mikado.Chr5G2.2.CDS3;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590261        26590393        .       +       .       ID=mikado.Chr5G2.2.exon4;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590495        26590566        .       +       0       ID=mikado.Chr5G2.2.CDS4;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590495        26590566        .       +       .       ID=mikado.Chr5G2.2.exon5;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590641        26590739        .       +       0       ID=mikado.Chr5G2.2.CDS5;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590641        26590739        .       +       .       ID=mikado.Chr5G2.2.exon6;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590880        26591092        .       +       0       ID=mikado.Chr5G2.2.CDS6;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590880        26591092        .       +       .       ID=mikado.Chr5G2.2.exon7;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591174        26591236        .       +       0       ID=mikado.Chr5G2.2.CDS7;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591174        26591236        .       +       .       ID=mikado.Chr5G2.2.exon8;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591324        26591442        .       +       0       ID=mikado.Chr5G2.2.CDS8;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591324        26591442        .       +       .       ID=mikado.Chr5G2.2.exon9;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591520        26591578        .       +       1       ID=mikado.Chr5G2.2.CDS9;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591520        26591578        .       +       .       ID=mikado.Chr5G2.2.exon10;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591681        26592002        .       +       2       ID=mikado.Chr5G2.2.CDS10;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591681        26592002        .       +       .       ID=mikado.Chr5G2.2.exon11;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26592528        26592561        .       +       1       ID=mikado.Chr5G2.2.CDS11;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26592528        26592561        .       +       .       ID=mikado.Chr5G2.2.exon12;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     mRNA    26588402        26592561        20      +       .       ID=mikado.Chr5G2.1;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     exon    26588402        26588625        .       +       .       ID=mikado.Chr5G2.1.exon1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     five_prime_UTR  26588402        26588625        .       +       .       ID=mikado.Chr5G2.1.five_prime_UTR1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26589196        26589279        .       +       .       ID=mikado.Chr5G2.1.exon2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     five_prime_UTR  26589196        26589237        .       +       .       ID=mikado.Chr5G2.1.five_prime_UTR2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26589238        26589279        .       +       0       ID=mikado.Chr5G2.1.CDS1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26589386        26590167        .       +       0       ID=mikado.Chr5G2.1.CDS2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26589386        26590167        .       +       .       ID=mikado.Chr5G2.1.exon3;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590261        26590393        .       +       1       ID=mikado.Chr5G2.1.CDS3;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590261        26590393        .       +       .       ID=mikado.Chr5G2.1.exon4;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590495        26590566        .       +       0       ID=mikado.Chr5G2.1.CDS4;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590495        26590566        .       +       .       ID=mikado.Chr5G2.1.exon5;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590641        26590739        .       +       0       ID=mikado.Chr5G2.1.CDS5;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590641        26590739        .       +       .       ID=mikado.Chr5G2.1.exon6;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590880        26591092        .       +       0       ID=mikado.Chr5G2.1.CDS6;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590880        26591092        .       +       .       ID=mikado.Chr5G2.1.exon7;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591174        26591236        .       +       0       ID=mikado.Chr5G2.1.CDS7;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591174        26591236        .       +       .       ID=mikado.Chr5G2.1.exon8;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591324        26591442        .       +       0       ID=mikado.Chr5G2.1.CDS8;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591324        26591442        .       +       .       ID=mikado.Chr5G2.1.exon9;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591520        26591578        .       +       1       ID=mikado.Chr5G2.1.CDS9;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591520        26591578        .       +       .       ID=mikado.Chr5G2.1.exon10;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591681        26592002        .       +       2       ID=mikado.Chr5G2.1.CDS10;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591681        26592002        .       +       .       ID=mikado.Chr5G2.1.exon11;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26592528        26592561        .       +       1       ID=mikado.Chr5G2.1.CDS11;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26592528        26592561        .       +       .       ID=mikado.Chr5G2.1.exon12;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     gene    26592649        26595691        19      +       .       ID=mikado.Chr5G3;Name=mikado.Chr5G3;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26592720        26595691        19      +       .       ID=mikado.Chr5G3.1;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.1;alias=st_Stringtie_STAR.21710.7.split2;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20;canonical_number=20;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     CDS     26592720        26593365        .       +       0       ID=mikado.Chr5G3.1.CDS1;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26592720        26593365        .       +       .       ID=mikado.Chr5G3.1.exon1;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26593449        26593836        .       +       2       ID=mikado.Chr5G3.1.CDS2;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26593449        26593836        .       +       .       ID=mikado.Chr5G3.1.exon2;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26593930        26594062        .       +       1       ID=mikado.Chr5G3.1.CDS3;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26593930        26594062        .       +       .       ID=mikado.Chr5G3.1.exon3;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594172        26594243        .       +       0       ID=mikado.Chr5G3.1.CDS4;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594172        26594243        .       +       .       ID=mikado.Chr5G3.1.exon4;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594318        26594416        .       +       0       ID=mikado.Chr5G3.1.CDS5;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594318        26594416        .       +       .       ID=mikado.Chr5G3.1.exon5;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594569        26594772        .       +       0       ID=mikado.Chr5G3.1.CDS6;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594569        26594772        .       +       .       ID=mikado.Chr5G3.1.exon6;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594860        26594922        .       +       0       ID=mikado.Chr5G3.1.CDS7;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594860        26594922        .       +       .       ID=mikado.Chr5G3.1.exon7;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595003        26595121        .       +       0       ID=mikado.Chr5G3.1.CDS8;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595003        26595121        .       +       .       ID=mikado.Chr5G3.1.exon8;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595210        26595268        .       +       1       ID=mikado.Chr5G3.1.CDS9;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595210        26595268        .       +       .       ID=mikado.Chr5G3.1.exon9;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595366        26595691        .       +       2       ID=mikado.Chr5G3.1.CDS10;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595366        26595691        .       +       .       ID=mikado.Chr5G3.1.exon10;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     mRNA    26592649        26595268        21      +       .       ID=mikado.Chr5G3.2;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.2;abundance=2.390309;alias=cl_Chr5.6283;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;primary=False
Chr5    Mikado_loci     exon    26592649        26593365        .       +       .       ID=mikado.Chr5G3.2.exon1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     five_prime_UTR  26592649        26592719        .       +       .       ID=mikado.Chr5G3.2.five_prime_UTR1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26592720        26593365        .       +       0       ID=mikado.Chr5G3.2.CDS1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26593449        26593836        .       +       2       ID=mikado.Chr5G3.2.CDS2;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26593449        26593836        .       +       .       ID=mikado.Chr5G3.2.exon2;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26593930        26594095        .       +       1       ID=mikado.Chr5G3.2.CDS3;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26593930        26594095        .       +       .       ID=mikado.Chr5G3.2.exon3;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594172        26594243        .       +       0       ID=mikado.Chr5G3.2.CDS4;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594172        26594243        .       +       .       ID=mikado.Chr5G3.2.exon4;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594318        26594416        .       +       0       ID=mikado.Chr5G3.2.CDS5;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594318        26594416        .       +       .       ID=mikado.Chr5G3.2.exon5;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594569        26594772        .       +       0       ID=mikado.Chr5G3.2.CDS6;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594569        26594772        .       +       .       ID=mikado.Chr5G3.2.exon6;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594860        26594922        .       +       0       ID=mikado.Chr5G3.2.CDS7;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594860        26594922        .       +       .       ID=mikado.Chr5G3.2.exon7;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26595003        26595121        .       +       0       ID=mikado.Chr5G3.2.CDS8;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26595003        26595121        .       +       .       ID=mikado.Chr5G3.2.exon8;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26595210        26595268        .       +       1       ID=mikado.Chr5G3.2.CDS9;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26595210        26595268        .       +       .       ID=mikado.Chr5G3.2.exon9;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     gene    26596207        26598231        20      +       .       ID=mikado.Chr5G4;Name=mikado.Chr5G4;multiexonic=False;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26596207        26598231        20      +       .       ID=mikado.Chr5G4.1;Parent=mikado.Chr5G4;Name=mikado.Chr5G4.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     CDS     26596207        26598192        .       +       0       ID=mikado.Chr5G4.1.CDS1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     exon    26596207        26598231        .       +       .       ID=mikado.Chr5G4.1.exon1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     three_prime_UTR 26598193        26598231        .       +       .       ID=mikado.Chr5G4.1.three_prime_UTR1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     gene    26599417        26601137        20      +       .       ID=mikado.Chr5G5;Name=mikado.Chr5G5;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26599417        26601137        20      +       .       ID=mikado.Chr5G5.1;Parent=mikado.Chr5G5;Name=mikado.Chr5G5.1;abundance=0.371780;alias=cl_Chr5.6286;canonical_junctions=1,2,3,4,5,6;canonical_number=6;canonical_proportion=1.0;primary=True
Chr5    Mikado_loci     exon    26599417        26599654        .       +       .       ID=mikado.Chr5G5.1.exon1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     five_prime_UTR  26599417        26599612        .       +       .       ID=mikado.Chr5G5.1.five_prime_UTR1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26599613        26599654        .       +       0       ID=mikado.Chr5G5.1.CDS1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26599767        26600053        .       +       0       ID=mikado.Chr5G5.1.CDS2;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26599767        26600053        .       +       .       ID=mikado.Chr5G5.1.exon2;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600151        26600244        .       +       1       ID=mikado.Chr5G5.1.CDS3;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600151        26600244        .       +       .       ID=mikado.Chr5G5.1.exon3;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600314        26600394        .       +       0       ID=mikado.Chr5G5.1.CDS4;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600314        26600394        .       +       .       ID=mikado.Chr5G5.1.exon4;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600497        26600616        .       +       0       ID=mikado.Chr5G5.1.CDS5;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600497        26600616        .       +       .       ID=mikado.Chr5G5.1.exon5;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600696        26600908        .       +       0       ID=mikado.Chr5G5.1.CDS6;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600696        26600908        .       +       .       ID=mikado.Chr5G5.1.exon6;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600987        26601085        .       +       0       ID=mikado.Chr5G5.1.CDS7;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600987        26601137        .       +       .       ID=mikado.Chr5G5.1.exon7;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     three_prime_UTR 26601086        26601137        .       +       .       ID=mikado.Chr5G5.1.three_prime_UTR1;Parent=mikado.Chr5G5.1
###
Chr5    Mikado_loci     superlocus      26575364        26579730        .       -       .       ID=Mikado_superlocus:Chr5-:26575364-26579730;Name=superlocus:Chr5-:26575364-26579730
Chr5    Mikado_loci     ncRNA_gene      26575364        26579730        18      -       .       ID=mikado.Chr5G6;Name=mikado.Chr5G6;multiexonic=True;superlocus=Mikado_superlocus:Chr5-:26575364-26579730
Chr5    Mikado_loci     ncRNA   26575711        26579730        18      -       .       ID=mikado.Chr5G6.1;Parent=mikado.Chr5G6;Name=cl_Chr5.6271;abundance=1.141582;alias=cl_Chr5.6271;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;primary=True
Chr5    Mikado_loci     exon    26575711        26575797        .       -       .       ID=mikado.Chr5G6.1.exon1;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26575885        26575944        .       -       .       ID=mikado.Chr5G6.1.exon2;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26576035        26576134        .       -       .       ID=mikado.Chr5G6.1.exon3;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26576261        26577069        .       -       .       ID=mikado.Chr5G6.1.exon4;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577163        26577288        .       -       .       ID=mikado.Chr5G6.1.exon5;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577378        26577449        .       -       .       ID=mikado.Chr5G6.1.exon6;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577856        26577937        .       -       .       ID=mikado.Chr5G6.1.exon7;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26578239        26578792        .       -       .       ID=mikado.Chr5G6.1.exon8;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579079        26579161        .       -       .       ID=mikado.Chr5G6.1.exon9;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579301        26579395        .       -       .       ID=mikado.Chr5G6.1.exon10;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579602        26579730        .       -       .       ID=mikado.Chr5G6.1.exon11;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     ncRNA   26578496        26579563        13      -       .       ID=mikado.Chr5G6.3;Parent=mikado.Chr5G6;Name=tr_c73_g1_i1.mrna1.160;alias=tr_c73_g1_i1.mrna1.160;canonical_junctions=1;canonical_number=1;canonical_proportion=1.0;ccode=j;gene_name=c73_g1_i1;primary=False
Chr5    Mikado_loci     exon    26578496        26578518        .       -       .       ID=mikado.Chr5G6.3.exon1;Parent=mikado.Chr5G6.3
Chr5    Mikado_loci     exon    26579301        26579563        .       -       .       ID=mikado.Chr5G6.3.exon2;Parent=mikado.Chr5G6.3
Chr5    Mikado_loci     ncRNA   26575364        26578163        16      -       .       ID=mikado.Chr5G6.2;Parent=mikado.Chr5G6;Name=cuff_cufflinks_star_at.23553.1;alias=cuff_cufflinks_star_at.23553.1;fpkm=2.9700103727;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;conf_hi=3.260618;conf_lo=2.679403;cov=81.895309;frac=0.732092;primary=False
Chr5    Mikado_loci     exon    26575364        26575410        .       -       .       ID=mikado.Chr5G6.2.exon1;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575495        26575620        .       -       .       ID=mikado.Chr5G6.2.exon2;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575711        26575797        .       -       .       ID=mikado.Chr5G6.2.exon3;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575885        26575944        .       -       .       ID=mikado.Chr5G6.2.exon4;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26576035        26576134        .       -       .       ID=mikado.Chr5G6.2.exon5;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26576261        26577069        .       -       .       ID=mikado.Chr5G6.2.exon6;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577163        26577288        .       -       .       ID=mikado.Chr5G6.2.exon7;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577378        26577449        .       -       .       ID=mikado.Chr5G6.2.exon8;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577856        26578163        .       -       .       ID=mikado.Chr5G6.2.exon9;Parent=mikado.Chr5G6.2
###
Things to note:
  • multiple RNAs for the same gene are identified by progressive enumeration after a “.” (eg. mikado.Chr5G5.1, mikado.Chr5G5.2, etc.).
  • All RNAs retain their old name under the attribute “alias”. If a transcript was split due to the presence of multiple ORFs, its alias will end with “.split<progressive ID>”.
  • RNAs have the boolean attribute “primary”, which identifies them as the primary transcript of the gene or as an alternative splicing isoform.
  • Non-primary RNAs have the additional “ccode” field, which identifies the class code assigned to them when they were compared to the primary transcript.
  • multiexonic RNAs have the attributes “canonical_junctions”, “canonical_number”, and “canonical_proportion” assigned to them. These properties are calculated by Mikado during the prepare stage.
Metrics files

These are tabular files that enumerate all the metrics raw values for each transcript. This is the section of the metrics file corresponding to the GFF3 file above:

tid parent  score   best_bits       blast_score     canonical_intron_proportion     cdna_length     cds_not_maximal cds_not_maximal_fraction        combined_cds_fraction   combined_cds_intron_fractioncombined_cds_length combined_cds_num        combined_cds_num_fraction       combined_utr_fraction   combined_utr_length     end_distance_from_junction      end_distance_from_tes   exon_fraction   exon_num        five_utr_length five_utr_num    five_utr_num_complete   has_start_codon has_stop_codon  highest_cds_exon_number highest_cds_exons_num   intron_fraction is_complete     max_intron_length       min_intron_length       non_verified_introns_num        num_introns_greater_than_max    num_introns_smaller_than_min    number_internal_orfs    proportion_verified_introns     proportion_verified_introns_inlocusretained_fraction    retained_intron_num     selected_cds_exons_fraction     selected_cds_fraction   selected_cds_intron_fraction    selected_cds_length     selected_cds_num        selected_cds_number_fraction    selected_end_distance_from_junction     selected_end_distance_from_tes  selected_start_distance_from_tss        snowy_blast_score       source_score    start_distance_from_tss three_utr_length        three_utr_num   three_utr_num_complete  utr_fraction    utr_length      utr_num utr_num_complete        verified_introns_num
mikado.Chr5G1.2     mikado.Chr5G1   19.0    1086.25 1086.25 1.0     1927    0       0.0     0.87    1.0     1683    10      0.91    0.13    244     0       157     0.92    11      87      2       1       TrueTrue        10      10      0.91    True    340     71      10      0       0       1       0.0     0       0.0     0       0.91    0.87    1.0     1683    10      0.91    0       157     87      13.78   0       87      157     1       0       0.13    244     3       1       0
mikado.Chr5G1.1     mikado.Chr5G1   21.89   1086.63 1086.63 1.0     1937    0       0.0     0.87    1.0     1683    10      0.91    0.13    254     0       157     0.92    11      97      2       1       TrueTrue        10      10      0.91    True    196     71      10      0       0       1       0.0     0       0.0     0       0.91    0.87    1.0     1683    10      0.91    0       157     97      13.78   0       97      157     1       0       0.13    254     3       1       0
mikado.Chr5G2.2     mikado.Chr5G2   19.04   1140.95 1140.95 1.0     2197    0       0.0     0.88    1.0     1938    11      0.92    0.12    259     0       0       0.92    12      259     2       1       TrueTrue        11      11      0.92    True    577     74      11      0       0       1       0.0     0       0.0     0       0.92    0.88    1.0     1938    11      0.92    0       0       259     16.66   0       259     0       0       0       0.12    259     2       1       0
mikado.Chr5G2.1     mikado.Chr5G2   20.06   1140.95 1140.95 1.0     2204    0       0.0     0.88    1.0     1938    11      0.92    0.12    266     0       0       0.92    12      266     2       1       TrueTrue        11      11      0.92    True    570     74      11      0       0       1       0.0     0       0.0     0       0.92    0.88    1.0     1938    11      0.92    0       0       266     16.66   0       266     0       0       0       0.12    266     2       1       0
mikado.Chr5G3.2     mikado.Chr5G3   8.59    1193.72 1193.72 1.0     1887    0       0.0     0.96    0.8     1816    9       1.0     0.04    71      0       0       0.75    9       71      1       0       TrueFalse       9       9       0.8     False   152     74      8       0       0       1       0.0     0       0.0     0       1.0     0.96    0.8     1816    9       1.0     0       0       71      14.16   0       71      0       0       0       0.04    71      1       0       0
mikado.Chr5G3.1     mikado.Chr5G3   19.0    1353.19 1353.19 1.0     2109    0       0.0     1.0     0.9     2109    10      1.0     0.0     0       0       0       0.83    10      0       0       0       TrueTrue        10      10      0.9     True    152     74      9       0       0       1       0.0     0       0.0     0       1.0     1.0     0.9     2109    10      1.0     0       0       0       16.66   0       00      0       0       0.0     0       0       0       0
mikado.Chr5G4.1     mikado.Chr5G4   20.0    1258.43 1258.43 1.0     2025    0       0.0     0.98    0       1986    1       1.0     0.02    39      0       39      1.0     1       0       0       0       TrueTrue        1       1       0       True    0       0       0       0       0       1       0       0       0.0     0       1.0     0.98    0       1986    1       1.0     0       39      0       16.66   0       039     1       0       0.02    39      1       0       0
mikado.Chr5G5.1     mikado.Chr5G5   20.0    565.46  565.46  1.0     1184    0       0.0     0.79    1.0     936     7       1.0     0.21    248     0       52      1.0     7       196     1       0       TrueTrue        7       7       1.0     True    112     69      6       0       0       1       0.0     0       0.0     0       1.0     0.79    1.0     936     7       1.0     0       52      196     13.67   0       196     52      1       0       0.21    248     2       0       0
mikado.Chr5G6.2     mikado.Chr5G6   17.1    0       0       1.0     1735    0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.56    9       0       0       0       False   False   0       0       0.62    False   406     84      0       0       0       0       1.0     0.89    0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       8
mikado.Chr5G6.1     mikado.Chr5G6   17.9    0       0       1.0     2197    0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.69    11      0       0       0       False   False   0       0       0.77    False   406     87      3       0       0       0       0.9     1.0     0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       9
mikado.Chr5G6.3     mikado.Chr5G6   13.0    0       0       1.0     286     0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.12    2       0       0       0       False   False   0       0       0.08    False   782     782     1       0       0       0       0.0     0.0     0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       0

As it can be noted, metrics can assume values in a very wide range. We direct you to the metrics section of the documentation for further details.

Scoring files

This file contains the scores assigned to each metric for each transcript. Only metrics which have been used for the scoring will be present. This is the section of the metrics file corresponding to the above GFF3 file:

tid parent  score   blast_score     cdna_length     cds_not_maximal cds_not_maximal_fraction        combined_cds_fraction   combined_cds_intron_fraction    combined_cds_length     combined_cds_num        end_distance_from_junction      exon_fraction   exon_num        five_utr_length five_utr_num    highest_cds_exon_number intron_fraction number_internal_orfs    proportion_verified_introns     retained_fraction       retained_intron_num     selected_cds_fraction   selected_cds_intron_fraction    selected_cds_length     selected_cds_num        source_score    three_utr_length        three_utr_num
mikado.Chr5G1.2     mikado.Chr5G1   19.0    0.0     0.0     1       1       0.0     1       1       1       1       1       1       0.0     1.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G1.1     mikado.Chr5G1   21.89   1.0     1.0     1       1       0.06    1       1       1       1       1       1       0.77    1.0     1       1       1.0     1       1       1       0.06    1       11      0       0.0     1.0
mikado.Chr5G2.2     mikado.Chr5G2   19.04   1       0.0     1       1       0.0     1       1       1       1       1       1       0.04    1.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G2.1     mikado.Chr5G2   20.06   1       1.0     1       1       0.03    1       1       1       1       1       1       0.0     1.0     1       1       1.0     1       1       1       0.03    1       11      0       0.0     0.0
mikado.Chr5G3.1     mikado.Chr5G3   19.0    1.0     1.0     1       1       0.0     1.0     1.0     1.0     1       1.0     1.0     0.0     0.0     1.0     1.0     1.0     1       1       1       0.0     1.01.0  1.0     0       0.0     0.0
mikado.Chr5G3.2     mikado.Chr5G3   8.59    0.0     0.0     1       1       0.19    0.0     0.0     0.0     1       0.0     0.0     0.71    0.5     0.0     0.0     1.0     1       1       1       0.19    0.00.0  0.0     0       0.0     0.0
mikado.Chr5G4.1     mikado.Chr5G4   20.0    1       1       1       1       0.0     1       1       1       1       1       1       0.0     0.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G5.1     mikado.Chr5G5   20.0    1       1       1       1       0.0     1       1       1       1       1       1       0.0     0.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G6.3     mikado.Chr5G6   13.0    1       0.0     1       1       0.0     1       1       1       1       0.0     0.0     0.0     0.0     1       0.0     0.0     0.0     1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G6.2     mikado.Chr5G6   17.1    1       0.76    1       1       0.0     1       1       1       1       0.78    0.78    0.0     0.0     1       0.78    0.0     1.0     1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G6.1     mikado.Chr5G6   17.9    1       1.0     1       1       0.0     1       1       1       1       1.0     1.0     0.0     0.0     1       1.0     0.0     0.9     1       1       0.0     1       11      0       0.0     0.0

The final score value is obtained by summing all the individual metrics.

Important

If you compare the scores assigned to transcripts at the loci level with those assigned at the subloci level, you will notice that the scores are different and that even some of the raw metrics values are. The former phenomenon is due to the fact that the Mikado scoring system is not absolute but relative; the latter, to the fact that some metrics are locus-dependent, ie their values change due the presence or absence of other transcripts. A typical example is given by the “retained_intron” metrics; retained introns are identified by looking for non-coding regions of transcript which fall inside the intron of another transcript. Changing the transcripts in the locus will change the value associated to this metric, as non-coding sections will or will not be classified as “retained introns”, and therefore the score associated with both the metric and the transcript.

Transcript padding

After calculating the final loci, Mikado can try to uniform the ends of transcripts present in the locus, by extending the shorter ones so that their ends coincide with those of longer transcripts in the locus. The procedure is explained more in detail in the dedicated section in the Algorithms page. The approach has been inspired by the consolidation approach taken by the Araport annotation for Arabidopsis thaliana [AraPort].

Usage

mikado pick allows to modify some of the parameters regarding the run at runtime. However, some sections - such as most of the settings regarding alternative splicing - are left untouched by the utility, and are best modified by editing the configuration file itself. The available parameters are as follows:

  • json-conf: required. This is the configuration file created in the first step of the pipeline.

  • gff; optionally, it is possible to point Mikado prepare to the GTF it should use here on the command line. This file should be the output of the preparation step. Please note that this file should be in GTF format, sorted by chromosome and position; if that is not the case, Mikado will fail.

  • db: Optionally, it is possible to specify the database to Mikado on the command line, rather than on the configuration file. Currently, this option supports SQLite databases only.

  • Options related to how Mikado will treat the data:

    • intron_range: this option expects a couple of positive integers, in ascending order, indicating the 98% CI where most intron lengths should fall into. Gene models with introns whose lengths fall outside of this range might be penalized, depending on the scoring system used. If uncertain, it is possible to use the included stats utility on the gene annotation of a closely related species.
    • no-purge: flag. If set, Mikado will not not exclude putative fragments from the output, but will report them (appropriately flagged).
    • flank: for the purposes of identifying fragments, it is useful to consider together loci which are not necessarily overlapping but which are lying relatively near on the genome sequence. This parameter (a positive integer) specifies the maximum distance for Mikado for gathering data together for this purpose.
    • mode: how Mikado will treat BLAST and ORF data in the presence of putative chimeras. Please refer to the algorithms section for details.
  • Options regarding the output files:

    • output-dir: Output directory. By default, Mikado will write all files and the log on the current directory.
    • loci_out: required. This it the main output file, in GFF format.
    • prefix: this ID will be prefixed to all gene and transcript models. IN general, IDs will be of the form “<prefix>.<chromosome><progressive ID>”. Default: Mikado.
    • source: source field prefix for the output files. Useful for eg loading Mikado runs into WebApollo [Apollo].
    • no_cds: if present, this flg will indicate to Mikado not to print out the CDS of selected models but only their transcript structures.
    • subloci_out: If requested, Mikado can output the data regarding the first intermediate step, ie the subloci. See the introduction for details.
    • monoloci_out: If requested, Mikado can output the data regarding the second intermediate step, ie the monosubloci. See the introduction for details.
  • Options regarding the resources to be used:

    • procs: number of processors to use.
    • start-method: multiprocessing start method. See the explanation on Python multiprocessing
    • single: flag. If present, multiprocessing will be disabled.
  • Options regarding logging:

    • log: name of the log file. By default, “pick.log”
    • verbose: sets the log level to DEBUG. Please be advised that the debug mode is extremely verbose and is bestly invoked only for real, targeted debugging sessions.
    • noverbose: sets the log level to ERROR. If set, in most cases, the log file will be practically empty.
    • log-level: this flag directly sets the log level. Available values: DEBUG, INFO, WARNING, ERROR.
  • Options related to padding:

    • pad: if set, this option will enforce transcript padding. The default is inferred from the configuration (on by default).
    • no-pad: if set, this option will disable transcript padding. The default is inferred from the configuration (on by default).
    • pad-max-splices: maximum amount of splicing sites that an expanded exon can cross. Default is inferred from the configuration file (currently default is 1)
    • pad-max-distance: Maximum amount of basepairs that transcripts can be padded with (per side). Default is inferred from the configuration file (default 300 bps)
    • fasta: genome FASTA file. Required if the padding is switched on. Default: inferred from the configuration file.

Usage:

$ mikado pick --help
usage: Mikado pick [-h] [--fasta GENOME]
                   [--start-method {fork,spawn,forkserver}] [--shm | --no-shm]
                   [-p PROCS] --configuration CONFIGURATION
                   [--scoring-file SCORING_FILE]
                   [-i INTRON_RANGE INTRON_RANGE]
                   [--no-pad | --pad | --codon-table CODON_TABLE]
                   [--pad-max-splices PAD_MAX_SPLICES]
                   [--pad-max-distance PAD_MAX_DISTANCE] [-r REGIONS]
                   [-od OUTPUT_DIR] [--subloci-out SUBLOCI_OUT]
                   [--monoloci-out MONOLOCI_OUT] [--loci-out LOCI_OUT]
                   [--prefix PREFIX] [--source SOURCE]
                   [--report-all-external-metrics] [--no_cds] [--flank FLANK]
                   [--max-intron-length MAX_INTRON_LENGTH] [--no-purge]
                   [--cds-only] [--as-cds-only] [--reference-update]
                   [--report-all-orfs] [--only-reference-update] [-eri] [-kdc]
                   [-mco MIN_CLUSTERING_CDNA_OVERLAP]
                   [-mcso MIN_CLUSTERING_CDS_OVERLAP] [--check-references]
                   [-db SQLITE_DB] [--single] [-l LOG]
                   [--verbose | --quiet | -lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                   [--mode {nosplit,stringent,lenient,permissive,split}]
                   [--seed SEED | --random-seed]
                   [gff]

Launcher of the Mikado pipeline.

positional arguments:
  gff

optional arguments:
  -h, --help            show this help message and exit
  --fasta GENOME, --genome GENOME
                        Genome FASTA file. Required for transcript padding.
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  --shm                 Flag. If switched, Mikado pick will copy the database
                        to RAM (ie SHM) for faster access during the run.
  --no-shm              Flag. If switched, Mikado will force using the
                        database on location instead of copying it to /dev/shm
                        for faster access.
  -p PROCS, --procs PROCS
                        Number of processors to use. Default: look in the
                        configuration file (1 if undefined)
  --configuration CONFIGURATION, --json-conf CONFIGURATION
                        Configuration file for Mikado.
  --scoring-file SCORING_FILE
                        Optional scoring file for the run. It will override
                        the value set in the configuration.
  -i INTRON_RANGE INTRON_RANGE, --intron-range INTRON_RANGE INTRON_RANGE
                        Range into which intron lengths should fall, as a
                        couple of integers. Transcripts with intron lengths
                        outside of this range will be penalised. Default: (60,
                        900)
  --no-pad              Disable transcript padding.
  --pad                 Whether to pad transcripts in loci.
  --codon-table CODON_TABLE
                        Codon table to use. Default: 0 (ie Standard, NCBI #1,
                        but only ATG is considered a valid start codon.
  --pad-max-splices PAD_MAX_SPLICES
                        Maximum splice sites that can be crossed during
                        transcript padding.
  --pad-max-distance PAD_MAX_DISTANCE
                        Maximum amount of bps that transcripts can be padded
                        with (per side).
  -r REGIONS, --regions REGIONS
                        Either a single region on the CLI or a file listing a
                        series of target regions. Mikado pick will only
                        consider regions included in this string/file. Regions
                        should be provided in a WebApollo-like format:
                        <chrom>:<start>..<end>
  --no_cds              Flag. If set, not CDS information will be printed out
                        in the GFF output files.
  --flank FLANK         Flanking distance (in bps) to group non-overlapping
                        transcripts into a single superlocus. Default:
                        determined by the configuration file.
  --max-intron-length MAX_INTRON_LENGTH
                        Maximum intron length for a transcript. Default:
                        inferred from the configuration file (default value
                        there is 1,000,000 bps).
  --no-purge            Flag. If set, the pipeline will NOT suppress any loci
                        whose transcripts do not pass the requirements set in
                        the JSON file.
  --cds-only            "Flag. If set, Mikado will only look for overlap in
                        the coding features when clustering transcripts
                        (unless one transcript is non-coding, in which case
                        the whole transcript will be considered). Please note
                        that Mikado will only consider the **best** ORF for
                        this. Default: False, Mikado will consider transcripts
                        in their entirety.
  --as-cds-only         Flag. If set, Mikado will only consider the CDS to
                        determine whether a transcript is a valid alternative
                        splicing event in a locus.
  --reference-update    Flag. If switched on, Mikado will prioritise
                        transcripts marked as reference and will consider any
                        other transcipt within loci only in reference to these
                        reference transcripts. Novel loci will still be
                        reported.
  --report-all-orfs     Boolean switch. If set to true, all ORFs will be
                        reported, not just the primary.
  --only-reference-update
                        Flag. If switched on, Mikado will only keep loci where
                        at least one of the transcripts is marked as
                        "reference". CAUTION: if no transcript has been marked
                        as reference, the output will be completely empty!
  -eri, --exclude-retained-introns
                        Exclude all retained intron alternative splicing
                        events from the final output. Default: False. Retained
                        intron events that do not dirsupt the CDS are kept by
                        Mikado in the final output.
  -kdc, --keep-disrupted-cds
                        Keep in the final output transcripts whose CDS is most
                        probably disrupted by a retained intron event.
                        Default: False. Mikado will try to detect these
                        instances and exclude them from the final output.
  -mco MIN_CLUSTERING_CDNA_OVERLAP, --min-clustering-cdna-overlap MIN_CLUSTERING_CDNA_OVERLAP
                        Minimum cDNA overlap between two transcripts for them
                        to be considered part of the same locus during the
                        late picking stages. NOTE: if --min-cds-overlap is not
                        specified, it will be set to this value! Default: 20%.
  -mcso MIN_CLUSTERING_CDS_OVERLAP, --min-clustering-cds-overlap MIN_CLUSTERING_CDS_OVERLAP
                        Minimum CDS overlap between two transcripts for them
                        to be considered part of the same locus during the
                        late picking stages. NOTE: if not specified, and
                        --min-cdna-overlap is specified on the command line,
                        min-cds-overlap will be set to this value! Default:
                        20%.
  --check-references    Flag. If switched on, Mikado will also check reference
                        models against the general transcript requirements,
                        and will also consider them as potential fragments.
                        This is useful in the context of e.g. updating an *ab-
                        initio* results with data from RNASeq, protein
                        alignments, etc.
  -db SQLITE_DB, --sqlite-db SQLITE_DB
                        Location of an SQLite database to overwrite what is
                        specified in the configuration file.
  --single              Flag. If set, Creator will be launched with a single
                        process, without involving the multithreading
                        apparatus. Useful for debugging purposes only.
  --mode {nosplit,stringent,lenient,permissive,split}
                        Mode in which Mikado will treat transcripts with
                        multiple ORFs. - nosplit: keep the transcripts whole.
                        - stringent: split multi-orf transcripts if two
                        consecutive ORFs have both BLAST hits and none of
                        those hits is against the same target. - lenient:
                        split multi-orf transcripts as in stringent, and
                        additionally, also when either of the ORFs lacks a
                        BLAST hit (but not both). - permissive: like lenient,
                        but also split when both ORFs lack BLAST hits - split:
                        split multi-orf transcripts regardless of what BLAST
                        data is available.
  --seed SEED           Random seed number. Default: 0.
  --random-seed         Generate a new random seed number (instead of the
                        default of 0)

Options related to the output files.:
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  --subloci-out SUBLOCI_OUT
  --monoloci-out MONOLOCI_OUT
  --loci-out LOCI_OUT   This output file is mandatory. If it is not specified
                        in the configuration file, it must be provided here.
  --prefix PREFIX       Prefix for the genes. Default: Mikado
  --source SOURCE       Source field to use for the output files.
  --report-all-external-metrics
                        Boolean switch. If activated, Mikado will report all
                        available external metrics, not just those requested
                        for in the scoring configuration. This might affect
                        speed in Minos analyses.

Log options:
  -l LOG, --log LOG     File to write the log to. Default: decided by the
                        configuration file.
  --verbose
  --quiet
  -lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level. Default: retrieved by the configuration
                        file.
Technical details

mikado pick uses a divide-et-impera algorithm to find and analyse loci separately. As the data to be integrated with the transcripts is stored on the database rather than be calculated on the fly, rerunning pick with different options takes little time and resources. To keep the data sorted, Mikado will write out temporary files during the operation and merge them at the end of the run (see function merge_loci_gff in the picking module.

The Mikado pipeline is composed of four different stages, that have to be executed serially:

  1. Mikado configure, for creating the configuration file that will be used throughout the run.
  2. Mikado prepare, for collapsing the input assemblies into a single file. After this step, it is possible to perform additional analyses on the data such as TransDecoder (highly recommended), Portcullis, or BLAST.
  3. Mikado serialise, to gather all external data into a single database.
  4. Mikado pick, to perform the actual selection of the best transcripts in each locus.
Compare
Compare
Overview

This Mikado utility allows the user to compare the transcripts from any two annotations. Its output allows:

  • To understand which reference transcript each prediction is most similar to
  • To understand which prediction transcript best represent each reference model
  • To have a summary information about the similarity between the two annotations.

Mikado compare has been directly inspired by the popular Cuffcompare [Cufflinks] utility and by ParsEval [ParsEval]. Please note that while superficially similar to Cuffcompare in the style of the output files, Mikado compare is more philosophically similar to ParsEval, as it will not try to aggregate transcripts in loci but will perform a pure comparison between the two annotation files. Mikado compare can accept BAM, BED12, GTF and GFF files as input, both for the reference or for the prediction.

Usage

Mikado compare is invoked by specifying the reference annotation and the desired mode of analysis. There are three possible options:

  1. In its default mode, compare will ask for a prediction annotation to compare the reference against.
  2. In the “self” mode, compare will do a self-comparison of the reference against itself, excluding as possible results the matches between a transcript and itself. It can be useful to glean the relationships between transcripts and genes in an annotation.
  3. In the “internal” mode of operations, compare will again perform a self-comparison, focussed on multi-isoform genes. For those, compare will perform and report all possible comparisons. It is useful to understand the relationships between the transcripts in a single locus.

Mikado stores the information of the reference in a specialised SQLite index, with a “.midx” suffix, which will be created by the program upon its first execution with a new reference. If the index file is already present, Mikado will try to use it rather than read again the annotation.

Note

Starting from version 1.5, Mikado compare supports multiprocessing. Please note that memory usage scales approximately linearly with the amount of processes requested.

Command line
    usage: Mikado compare [-h] -r REFERENCE
                      (-p PREDICTION | --self | --internal | --index)
                      [--distance DISTANCE] [-pc] [-o OUT] [--lenient] [-eu]
                      [-n] [-erm] [-upa] [-l LOG] [-v] [-z]
                      [--processes PROCESSES]

optional arguments:
  -h, --help            show this help message and exit
  --distance DISTANCE   Maximum distance for a transcript to be considered a
                        polymerase run-on. Default: 2000
  -pc, --protein-coding
                        Flag. If set, only transcripts with a CDS (both in
                        reference and prediction) will be considered.
  -o OUT, --out OUT     Prefix for the output files. Default: mikado_compare
  --lenient             If set, exonic statistics will be calculated leniently
                        in the TMAP as well - ie they will consider an exon as
                        match even if only the internal junction has been
                        recovered.
  -eu, --exclude-utr    Flag. If set, reference and prediction transcripts
                        will be stripped of their UTRs (if they are coding).
  -n, --no-index, --no-save-index
                        Unless this flag is set, compare will save an index of
                        the reference to quicken multiple calls.
  -erm, --extended-refmap
                        Flag. If set, the RefMap will also contain recall and
                        precision statistics - not just the F1.
  -upa, --use-prediction-alias
                        Flag. If set, Mikado Compare will use the alias -
                        rather than the transcript ID - to report the results
                        for prediction transcripts in the TMAP and REFMAP
                        files.
  -l LOG, --log LOG
  -v, --verbose
  -z, --gzip            Flag. If set, TMAP and REFMAP files will be GZipped.
  --processes PROCESSES

Prediction and annotation files.:
  -r REFERENCE, --reference REFERENCE
                        Reference annotation file. By default, an index will
                        be crated and saved with the suffix ".midx".
  -p PREDICTION, --prediction PREDICTION
                        Prediction annotation file.
  --self                Flag. If set, the reference will be compared with
                        itself. Useful for understanding how the reference
                        transcripts interact with each other.
  --internal            Flag. If set, for each gene with more than one
                        transcript isoform each will be compared to the
                        others. Useful for understanding the structural
                        relationships between the transcripts in each gene.
  --index               Flag. If set, compare will stop after having generated
                        the GFF index for the reference.
Output files

Mikado compare produces two tabular files, tmap and refmap, and one statistics file.

TMAP files

TMAP are tabular files that store the information regarding the best match for each prediction in the reference. The columns are as follows:

  1. ref_id: Transcript ID of the matched reference model(s).
  2. ref_gene: Gene ID of the matched reference model(s).
  3. ccode: class code of the match. See the relevant section on Class codes.
  4. tid: Transcript ID of the prediction model.
  5. gid: Gene ID of the prediction model.
  6. tid_num_exons: Number of exons of the prediction model.
  7. ref_num_exons: Number of exons of the reference model.
  8. n_prec: Nucleotide precision of the prediction ( TP / (length of the prediction))
  9. n_recall: Nucleotide recall of the reference (TP / (length of the reference))
  10. n_f1: F1 of recall and precision at the nucleotide level.
  11. j_prec: Splice junction precision of the prediction model ( TP / (number of splice sites in the prediction))
  12. j_recall: Splice junction recall of the reference model ( TP / (number of splice sites in the reference))
  13. j_f1: F1 of recall and precision at the splice junction level.
  14. e_prec: Exon precision of the prediction model ( TP / (number of exons in the prediction)). NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  15. e_recall: Exon recall of the reference model ( TP / (number of exons in the reference))
  16. e_f1: F1 of recall and precision at the exon level.
  17. distance: Distance of the model from its putative match.
  18. location: location of the match, with the format <chromosome>:<start>..<end>

An example of TMAP file is as follows:

ref_id      ref_gene        ccode   tid     gid     tid_num_exons   ref_num_exons   n_prec  n_recall        n_f1    j_prec  j_recall        j_f1    e_prec  e_recall        e_f1    distance        location
AT5G66600.2 AT5G66600       =       cuff_cufflinks_star_at.23553.1  cuff_cufflinks_star_at.23553.1.gene     9       9       91.30   81.31   86.02   100.00  100.00  100.00  77.78   77.78   77.78   0       Chr5:26575000..26578163
AT5G66600.2 AT5G66600       C       cl_Chr5.6272    cl_Chr5.6272.gene       7       9       94.95   72.43   82.18   100.00  75.00   85.71   85.71   66.67   75.00   0       Chr5:26575000..26578087
AT5G66620.1,AT5G66630.1,AT5G66631.1 AT5G66620,AT5G66630,AT5G66631   f,j,j,G st_Stringtie_STAR.21710.15      st_Stringtie_STAR.21710.15.gene 8       11,10,1 19.13,19.95,35.98       54.57,45.65,100.00      28.33,27.76,52.92       28.57,64.29,0.00        20.00,50.00,0.00        23.53,56.25,0.00        12.50,37.50,0.00        9.09,30.00,0.00 10.53,33.33,0.00        0       Chr5:26588402..26598231

You can notice that the third example is particular as the prediction transcript matches not one but multiple reference transcripts. This is a fusion event.

RefMap files

RefMap files are tabular files which store the information regarding the best match for each reference transcript, among all possible prediction models. The columns of the file are as follows:

  1. ref_id: Transcript ID of the reference model.
  2. ccode: class code of the match. See the relevant section on Class codes.
  3. tid: Transcript ID of the prediction model.
  4. gid: Gene ID of the prediction model.
  5. nF1: F1 of recall and precision at the nucleotide level.
  6. jF1: F1 of recall and precision at the splice junction level.
  7. eF1: F1 of recall and precision at the exon level. NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  8. ref_gene: Gene ID of the reference model.
  9. best_ccode: Best possible class code found for any of the transcripts of the gene.
  10. best_tid: Transcript ID of the prediction model which fit best one of the transcript models of the reference gene.
  11. best_gid: Gene ID of the prediction model which fit best one of the transcript models of the reference gene.
  12. best_nF1: F1 of recall and precision at the nucleotide level, for the best possible comparison.
  13. best_jF1: F1 of recall and precision at the splice junction level, for the best possible comparison.
  14. best_eF1: F1 of recall and precision at the exon level, for the best possible comparison.
  15. location: location of the match, with the format <chromosome>:<start>..<end>

An example of a RefMap file is as follows:

ref_id      ccode   tid     gid     nF1     jF1     eF1     ref_gene        best_ccode      best_tid        best_gid        best_nF1        best_jF1        best_eF1    location
AT5G66610.1 =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584780..26587912
AT5G66610.2 J       mikado.Chr5G4.2 mikado.Chr5G4   93.91   94.74   76.19   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584774..26587912
AT5G66620.1 j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   AT5G66620       j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   Chr5:26588402..26592423
AT5G66630.1 n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   AT5G66630       n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   Chr5:26591981..26595922

Please note that the third example (AT5G66630.1) has as best possible match a fusion event.

Extended RefMap files

Mikado can optionally produce more detailed RefMap files, listing the details on recall and precision for each match (rather than just the F1 for each level). If the flag -erm is present on the command line, the following extra-fields will be present in the file:

  1. nRecall: recall at the nucleotide level.
  2. nPrecision: precision at the nucleotide level.
  3. jRecall: recall at the junction level.
  4. jPrecision: precision at the junction level.
  5. eRecall: recall at the exon level.
  6. ePrecision: precision at the exon level.
  7. best_nRecall: recall at the nucleotide level, for the best possible comparison.
  8. best_nPrecision: precision at the nucleotide level, for the best possible comparison.
  9. best_jRecall: recall at the junction level, for the best possible comparison.
  10. best_jPrecision: precision at the junction level, for the best possible comparison.
  11. best_eRecall: recall at the exon level, for the best possible comparison.
  12. best_ePrecision: precision at the exon level, for the best possible comparison.
Stats files

These files provide a summary of the comparison between the reference and the annotation. An example is as follows:

Command line:
/home/lucve/miniconda3/envs/mikado2/bin/mikado compare -r reference.gff3 -p Daijin/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o compare -l compare.log
18 reference RNAs in 12 genes
22 predicted RNAs in  15 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 94.90  83.22  88.68
            Exon level (stringent): 80.56  71.60  75.82
              Exon level (lenient): 91.18  76.54  83.22
                 Splice site level: 95.19  81.15  87.61
                      Intron level: 96.84  88.19  92.31
                 Intron level (NR): 94.34  79.37  86.21
                Intron chain level: 69.23  50.00  58.06
           Intron chain level (NR): 69.23  50.00  58.06
      Transcript level (stringent): 55.56  45.45  50.00
  Transcript level (>=95% base F1): 72.22  59.09  65.00
  Transcript level (>=80% base F1): 72.22  59.09  65.00
         Gene level (100% base F1): 75.00  60.00  66.67
        Gene level (>=95% base F1): 83.33  66.67  74.07
        Gene level (>=80% base F1): 83.33  66.67  74.07

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 9
             Matched intron chains: 9
   Matching monoexonic transcripts: 4
    Matched monoexonic transcripts: 4
        Total matching transcripts: 13
         Total matched transcripts: 13

          Missed exons (stringent): 14/72  (19.44%)
           Novel exons (stringent): 23/81  (28.40%)
            Missed exons (lenient): 6/68  (8.82%)
             Novel exons (lenient): 19/81  (23.46%)
                    Missed introns: 3/53  (5.66%)
                     Novel introns: 13/63  (20.63%)

       Missed transcripts (0% nF1): 0/18  (0.00%)
        Novel transcripts (0% nF1): 3/22  (13.64%)
             Missed genes (0% nF1): 0/12  (0.00%)
              Novel genes (0% nF1): 3/15  (20.00%)

The first section of the file describes:

  1. Concordance of the two annotations at the base level (recall, precision, and F1)
  2. Concordance of the two annotation at the exonic level (recall, precision, and F1), in two ways:
    • “stringent”: only perfect exonic matches are considered.
    • “lenient”: in this mode, terminal exons are counted as a match if the internal border is matched. See the RGASP paper [RGASP] for details on the rationale.
  3. Concordance of the two annotations in regards with the splice junctions, analysed independently of one another.
  4. Concordance of the two annotations at the intron level.
  5. Concordance of the two annotations at the intron chain level - how many intron chains of the reference are found identical in the prediction. Only multiexonic models are considered for this level.
  6. Concordance of the two annotations at the transcript level, in three different modes:
    • “stringent”: in this mode, only perfect matches are considered.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
  7. Concordance of the two annotations at the gene level, in three different modes:
    • “stringent”: in this mode, we consider reference genes for which it was possible to find at least one perfect match for one of its transcripts.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic.

In the second section, the file reports how many of the intron chains, monoexonic transcripts and total transcripts in the reference were matched by at least one matching prediction transcript. Finally, in the third section the file reports the number of missed (present in the reference but not in the prediction) or novel (viceversa - present in the prediction but not in the reference) features.

Note

Please note that a gene might be considered as “found” even if its best match is intronic, on the opposite strand, or not directly overlapping it, or is in the opposite strand (see next section, in particular the Intronic, Fragment and No overlap categories).

Class codes

In addition to recall, precision and F1 values, Mikado assign each comparison between two transcripts a class code, which summarises the relationship between the two transcripts. The idea is lifted from the popular tool Cuffcompare, although Mikado greatly extends the catalogue of possible class codes. All class codes fall within one of the following categories:

  • Match: class codes of this type indicate concordance between the two transcript models.
  • Extension: class codes of this type indicate that one of the two models extends the intron chain of the other, without internal interruptions. The extension can be from either perspective - either the prediction extends the reference, or it is instead contained within the reference (so that switching perspectives, the reference would “extend” the prediction).
  • Alternative splicing: the two exon chains overlap but differ in significant ways.
  • Intronic: either the prediction is completely contained within the introns of the reference, or viceversa.
  • Overlap: the two transcript models generically overlap on their exonic sequence.
  • Fragment: the prediction is a fragment of the reference, in most cases because they are on opposite strands.
  • No overlap: the prediction and the reference are near but do not directly overlap.
  • Fusion: this special class code is a qualifier and it never appears on its own. When a transcript is defined as a fusion between two or more reference transcript, it will appear on multiple lines, one for each of the matches. In each line, the class code will be a “f,” followed by the class code assigned to that particular comparison. So e.g. a prediction which matches two reference models, one with a “j” and another with a “o”, will be present in two different lines; the first one with a class code of f,j and the second with a class code of f,o. In the refmap file, if the fusion is the best match, the class code will be “f” followed by the class code for the individual reference transcript; e.g., “f,j”

Available class codes

Class code Definition Reference multiexonic? Prediction multiexonic? Nucleotide: RC, PC, F1 Junction: RC, PC, F1 Reverse Category
               
= Complete intron chain match. True True NA 100%, 100%, 100% = Match
_ Complete match between two monoexonic transcripts. False False NA, NA, >=80% NA _ Match
n Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites outside of the reference transcript boundaries. True True 100%, < 100%, <100% 100%, < 100%, <100% c Extension
J Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites inside of the reference transcript boundaries. True True 100%, <= 100%, <100% 100%, < 100%, <100% C Extension
c The prediction is either multiexonic and with its intron chain completely contained within that of the reference, or monoexonic and contained within one of the reference exons. NA NA < 100%, 100%, NA < 100%, 100%, NA n Extension
C The prediction intron chain is completely contained within that of the reference transcript, but it partially debords either into its introns or outside of the reference boundaries. True True <= 100%, < 100%, < 100% < 100%, 100%, < 100% J or j Extension
j Alternative splicing event. True True NA <= 100%, 100%, < 100% j or C Alternative splicing
h Structural match between two models where where no splice site is conserved but at least one intron of the reference and one intron of the prediction partially overlap. True True > 0%, > 0%, > 0% 0%, 0%, 0% h Alternative splicing
g The monoexonic prediction overlaps one or more exons of the reference transcript; the borders of the prediction cannot fall inside the introns of the reference. The prediction transcript can bridge multiple exons of the reference model. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Alternative splicing
G Generic match of a multiexonic prediction transcript versus a monoexonic reference. False True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% g Alternative splicing
o Generic overlap between two multiexonic transcripts, which do not share any overlap among their introns. True True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% o Overlap
e Single exon transcript overlapping one reference exon and at least 10 bps of a reference intron, indicating a possible pre-mRNA fragment. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Overlap
m Generic match between two monoexonic transcripts. False False NA, NA, < 80% NA m Overlap
i Monoexonic prediction completely contained within one intron of the reference transcript. True False 0%, 0%, 0% 0%, 0%, 0% ri Intronic
I Prediction completely contained within the introns of the reference transcript. True True 0%, 0%, 0% 0%, 0%, 0% rI Intronic
ri Reverse intron transcript - the monoexonic reference is completely contained within one intron of the prediction transcript. False True 0%, 0%, 0% 0%, 0%, 0% i Intronic
rI Multiexonic reference completely contained within the introns of the prediction transcript. True True 0%, 0%, 0% 0%, 0%, 0% I Intronic
f Fusion - this special code is applied when a prediction intersects more than one reference transcript. To be considered for fusions, candidate references must either share at least one splice junction with the prediction, or have at least 10% of its bases recalled. If two or more reference transcripts fit these constraints, then the prediction model is classified as a fusion. NA NA > 10%, 0%, 0% > 0%, 0%, 0% NA Fusion
x Monoexonic match on the opposite strand. NA False >0%, >0%, >0% 0%, 0%, 0% x or X Fragment
X Multiexonic match on the opposite strand. NA True >0%, >0%, >0% NA x or X Fragment
p The prediction is on the same strand of a neighbouring but non-overlapping transcript. Probable polymerase run-on NA NA 0%, 0%, 0% 0%, 0%, 0% p Fragment
P The prediction is on the opposite strand of a neighbouring but non- overlapping transcript. Probable polymerase run-on. NA NA 0%, 0%, 0% 0%, 0%, 0% P Fragment
u Unknown - no suitable model has been found near enough the prediction to perform a comparison. NA NA 0%, 0%, 0% 0%, 0%, 0% NA Unknown
Technical details

Mikado compare conceptualizes the reference annotation as a collection of interval trees, one per chromosome or scaffold, where each node corresponds to an array of genes at the location. The gene and transcript objects are stored separately. The location of each transcript model in the prediction is queried against the tree, with a padding (default 2kbps) to allow for neighouring but non-overlapping genes, and the transcript itself is subsequently compared with each reference transcript contained in the hits. Each comparison will yield precision, recall and F1 values for the nucleotide, splice junction and exonic levels, together with an associated class code. The best match for the prediction is selected for by choosing the comparison yielding the best splice junction F1 and the best nucleotide F1, in this order. If the prediction transcript overlaps two or more genes on the same strand, and for at least two it has one match each with either 10% nucleotide recall or junction recall over 0%, it is deemed as a fusion event, and its line in the tmap file will report the best match against each of the fused genes, separated by comma.

Each calculated match against a reference transcript is stored as a potential best match for the reference transcript. At the end of the run, the hits for each reference transcript will be ordered using the following function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@staticmethod
def result_sorter(result):

    """
    Method to sort the results for the refmap. Order:
    - CCode does not contain "x", "P", "p" (i.e. fragments on opposite strand or
    polymerase run-on fragments)
    - Exonic F1 (e_f1)
    - Junction F1 (j_f1)
    - "f" in ccode (i.e. transcript is a fusion)
    - Nucleotide F1 (n_f1)

    :param result: a resultStorer object
    :type result: ResultStorer
    :return: (int, float, float, float)
    """

    bad_ccodes = ["x", "X", "P", "p"]
    bad_ccodes = set(bad_ccodes)

    orderer = (len(set.intersection(bad_ccodes, set(result.ccode))) == 0,
               result.j_f1, result.e_f1,
               result.n_f1,
               "f" in result.ccode)

    return orderer

This function is used to select both for the best match for the transcript, as well as to select among these matches for the best match for the gene.

The interval tree data structure is created using Cython code originally part of the bx-python, kindly provided by Dr. Taylor for modification and inclusion in Mikado. We subsequently integrated the improvements made on the code by Dr. Brent Pedersen in his quicksect fork.

We further modified the original code for allowing “fuzzy matches”, as in, allowing some leeway on how far the edges of the query interval are from the target intervals present in the tree.

The .midx files storing the annotation for repeated compare runs are SQLite files. In them, Mikado will store for each gene its coordinates, its transcripts, and the location of exons and CDS features. MIDX files make repeated runs quite faster, as the program will not have to re-parse the GFF file.

Note

Before version 1.1, Mikado MIDX files were GZip-compressed files. If you try to use an old index, Mikado will complain about it and recreate it from scratch.

The comparison code is written in Cython and is crucial during the picking phase of Mikado, not just for the functioning of the comparison utility.

Mikado also comprises a dedicated utility, Mikado compare, to assess the similarity between two annotations.

Daijin
The Daijin pipeline for driving Mikado

snake_badge

No emperor or empress can lead its nation without a trusty chancellor to help him or her in organising the bureaucracy. Daijin, the Japanese minister, has the same role in Mikado - it smooths the path to go from a collection of read inputs (both RNA-Seq or long reads) to a polished transcriptome assembly. The pipeline is based on Snakemake [Snake]; while Snakemake can support any scheduling system, our pipeline manager currently supports only three (SLURM, PBS and LSF), plus any DRMAA-compliant batch submission system. Other schedulers can be added upon request.

This page contains a detailed explanation on how to use Daijin. We also provide a tutorial that will guide you through using the manager for analysing an RNA-Seq sample for Sc. pombe.

Hint

It is possible to launch the two steps of the pipeline directly with Snakemake, using the snakefiles located in

Mikado.daijin: assemble.smk for the first step, and mikado.smk for the second.

Warning

Starting from :TODO:

Configuring Daijin

daijin configure creates the configuration file that will drive Daijin, in YAML format. Most options can be specified by command line. Available parameters for the command line are:

  • out: Output file, ie the configuration file for Daijin.
  • od: Directory that Daijin will use as master for its jobs.
  • genome: Genome FASTA file. Required.
  • transcriptome: Optional reference transcriptome, to be used for alignment and assembly. Please note that Mikado will not use the reference annotation, it will only be used during the alignment and assembly steps.
  • name: Name to be applied to the genome, eg the species.

Warning

The name must avoid containing spaces or any non-standard character. This string will be used among other things to determine the name of the index to be used by the aligners; if the string contains non-valid characters it will cause them - and on cascade Daijin - to fail.

  • prot-db: this parameter specifies a protein database to be used for BLAST. If none is specified, this step will be omitted.

  • aligners: aligner(s) to be used during the run. Currently, Daijin supports the following aligners:

    • gsnap
    • star
    • tophat2
    • hisat2
  • assemblers: assembler(s) to be used during the run. Currently, Daijin supports the following RNA-Seq assemblers:

    • cufflinks
    • class2
    • stringtie
    • trinity
  • threads: Number of threads to be requested for parallelisable steps.

  • modes: Daijin can run Mikado in multiple modes regarding the handling of putative chimeras. Specify those you desire to execute here.

  • flank: Amount of flanking that Mikado will allow while looking for fragments around good gene loci. Default 1000 bps. It is advisable to reduce it in species with compact genomes.

  • scheduler: if Daijin has to execute the pipeline on a cluster (potentially using DRMAA), it is necessary to specify the scheduler here. At the moment we support the following widely used schedulers: PBS, LSF, SLURM.

  • r1, r2, samples: RNA-Seq reads 1, RNA-Seq reads 2, and sample name. At least one of each is required.

  • strandedness: if not specified, all samples will be assumed to be unstranded. Specify it as you would with HISAT or TopHat2.

  • cluster_config: if specified, a sample cluster configuration file will be copied in the specified location. A cluster configuration file has the following structure:

__default__:
    threads: 4
    memory: 10000
asm_trinitygg:
    memory: 20000
bam_sort:
    memory: 20000

As it can be seen, it is a YAML format with two fields: __default__ and asm_trinitygg. The former specifies common parameters for all the steps: the queue to be used, the number of threads to request, and the memory per job. The “asm_trinitygg” field specifies particular parameters to be applied to the asm_trinitygg rule - in this case, increasing the requested memory (Trinity is a de novo assembler and uses much more memory than most reference-guided assemblers). Please note that the number of threads specified on the command line, or in the configuration file proper, takes precedence over the equivalent parameter in the cluster configuration file.

$ daijin configure --help
usage: daijin configure [-h] [-c CLUSTER_CONFIG] [--threads N] [-od OUT_DIR]
                        [-o OUT] [--scheduler {,SLURM,LSF,PBS}] [--name NAME]
                        --genome GENOME [--transcriptome TRANSCRIPTOME]
                        [-r1 R1 [R1 ...]] [-r2 R2 [R2 ...]]
                        [-s SAMPLES [SAMPLES ...]]
                        [-st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]]
                        -al
                        [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
                        -as
                        [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
                        [--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml,spombe.yaml}]
                        [--copy-scoring COPY_SCORING]
                        [-m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]]
                        [--flank FLANK] [--prot-db PROT_DB [PROT_DB ...]]


optional arguments:
  -h, --help            show this help message and exit
  -al [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]], --aligners [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
                        Aligner(s) to use for the analysis. Choices: gsnap,
                        star, hisat, tophat2
  -as [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]], --assemblers [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
                        Assembler(s) to use for the analysis. Choices: class,
                        cufflinks, stringtie, trinity

Options related to how to run Daijin - threads, cluster configuration, etc.:
  -c CLUSTER_CONFIG, --cluster_config CLUSTER_CONFIG
                        Cluster configuration file to write to.
  --threads N, -t N     Maximum number of threads per job. Default: 4
  -od OUT_DIR, --out-dir OUT_DIR
                        Output directory. Default if unspecified: chosen name.
  -o OUT, --out OUT     Output file. If the file name ends in "json", the file
                        will be in JSON format; otherwise, Daijin will print
                        out a YAML file. Default: STDOUT.
  --scheduler {,SLURM,LSF,PBS}
                        Scheduler to use. Default: None - ie, either execute
                        everything on the local machine or use DRMAA to submit
                        and control jobs (recommended).

Arguments related to the reference species.:
  --name NAME           Name of the species under analysis.
  --genome GENOME, -g GENOME
                        Reference genome for the analysis, in FASTA format.
                        Required.
  --transcriptome TRANSCRIPTOME
                        Reference annotation, in GFF3 or GTF format.

Arguments related to the input paired reads.:
  -r1 R1 [R1 ...], --left_reads R1 [R1 ...]
                        Left reads for the analysis. Required.
  -r2 R2 [R2 ...], --right_reads R2 [R2 ...]
                        Right reads for the analysis. Required.
  -s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
                        Sample names for the analysis. Required.
  -st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...], --strandedness {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]
                        Strandedness of the reads. Specify it 0, 1, or number
                        of samples times. Choices: fr-unstranded, fr-
                        secondstrand, fr-firststrand.

Options related to the Mikado phase of the pipeline.:
  --scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}
                        Available scoring files.
  --copy-scoring COPY_SCORING
                        File into which to copy the selected scoring file, for
                        modification.
  -m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...], --modes {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]
                        Mikado pick modes to run. Choices: nosplit, split,
                        permissive, stringent, lenient
  --flank FLANK         Amount of flanking for grouping transcripts in
                        superloci during the pick phase of Mikado.
  --prot-db PROT_DB [PROT_DB ...]
                        Protein database to compare against, for Mikado.

Warning

if DRMAA is requested and no scheduler is specified, Daijin will fail. For this reason, Daijin requires a scheduler name. If one is not provided, Daijin will fall back to local execution.

Tweaking the configuration file

Daijin can be configured so to run each assembler and/or aligner multiple times, with a different set of parameters each. This is achieved by specifying the additional command line arguments in the array for each program. For example, in this section:

align_methods:
  hisat:
  - ''
  - "-k 10"
asm_methods:
  class:
  - ''
  - "--set-cover"
  stringtie:
  - ''
  - "-T 0.1"

we are asking Daijin to execute each program twice:

  • HISAT2: once with default parameters, once reporting the 10 best alignments for each pair
  • CLASS2: once with default parameters, once using the “set-cover” algorithm
  • Stringtie: once with default parameters, once lowering the minimum expression to 0.1 TPM.

If you are running Daijin on a cluster and the software tools have to be sourced or loaded, you can specify the versions and command to load in the configuration file itself. Edit the file at this section:

load:
  #  Commands to use to load/select the versions of the programs to use. Leave an empty
  #  string if no loading is necessary.
  blast: 'source blast-2.3.0'
  class: 'source class-2.12'
  cufflinks: ''
  gmap: ''
  hisat: 'source HISAT-2.0.4'
  mikado: 'souce mikado-devel'
  portcullis: 'source portcullis-0.17.2'
  samtools: 'source samtools-1.2'
  star: ''
  stringtie: 'source stringtie-1.2.4'
  tophat: ''
  transdecoder: 'source transdecoder-3.0.0'
  trinity: ''

In this case, the cluster requires to load software on-demand using the source command, and we specify the versions of the programs we wish to use.

Regarding read alignment, it is also possible to specify the minimum and maximum permissible intron lengths:

short_reads:
 #  Parameters related to the reads to use for the assemblies. Voices:
 #  - r1: array of left read files.
 #  - r2: array of right read files. It must be of the same length of r1; if one
 #    one or more of the samples are single-end reads, add an empty string.
 #  - samples: array of the sample names. It must be of the same length of r1.
 #  - strandedness: array of strand-specificity of the samples. It must be of the
 #    same length of r1. Valid values: fr-firststrand, fr-secondstrand, fr-unstranded.
 max_intron: 10000
 min_intron: 20
 r1:
 - SRR1617247_1.fastq.gz
 r2:
 - SRR1617247_2.fastq.gz
 samples:
 - SRR1617247
 strandedness:
 - fr-secondstrand

Regarding the Mikado stage of Daijin, the configuration file contains all the fields that can be found in a normal Mikado configuration file. All mikado-specific parameters are stored under the “mikado” field. It is also possible to modify the following:

  • blastx: these are parameters regarding the running of BLASTX. This field contains the following variables:

    • prot-db: this is an array of FASTA files to use as protein databases for the BLASTX step.
    • chunks: number of chunks to divide the BLASTX into. When dealing with big input FASTA files and with a cluster at disposal, it is more efficient to chunk the input FASTA in multiple smaller files and execute BLASTX on them independently. The default number of chunks is 10. Increase it as you see fit - it often makes sense, especially on clusters, to launch a multitude of small jobs rather than a low number of big jobs.
    • evalue: maximum e-value for the searches.
    • max_target_seqs: maximum number of hits to report.
  • transdecoder: parameters related to transdecoder. At the moment only one is available:

    • min_protein_len: minimum protein length that TransDecoder should report. The default value set by Mikado, 30, is much lower than the default (100) and this is intentional, as the chimera splitting algorithm relies on the capability of TransDecoder of finding incomplete short ORFs at different ends of a transcript.
Structure of the output directory

Daijin will organise the output directory in 5 major sections, plus the configuration file for the Mikado step:

  1. 1-reads: this folder contains links to the original read files.

  2. 2-alignments: this folder stores the indices built for each aligner, and the results. The structure is as follows:

    • output: this folder contains the final BAM files, sorted and indexed.

    • alignments.stats: this file contains a table reporting the most salient parameters derived from samtools stats calls onto the multiple BAM files.

    • One folder per aligner, inside which are present:

      • index folder, containing the genome indices for the tool
      • One folder per sample, containing the output BAM file.
  3. 3-assemblies: this folder contains the RNA-Seq assemblies which will be used as input by Mikado. The structure is as follows:

    • output: this folder contains the final GTF/GFF3 assemblies. For each combination of aligner/assembler that has been requested, Daijin will here provide:

    • assembly.stats: a tabular file collecting the most salient data from the statistics files generated for each assembly

    • One folder per assembly, containing tool-specific files and the final assembly.

  4. 4-portcullis: this folder contains the results of Portcullis, if its execution has been requested. The folder will contain the following:

    • output: folder which contains a merged BED file of reliable junctions, creating by merging all junctions from all alignments.
    • One folder per alignment analysed. We redirect you to the documentation of the tool for more details.
  5. mikado.yaml: final output file of the assemble part of the pipeline. This file will act both as the configuration for Daijin and for Mikado; for a description of the Mikado specific fields, we remand to the section on the configuration of the tool.

  6. 5-mikado: this folder contains the results for mikado. It is organised as follows:

    1. a link to the genome FASTA, and corresponding FAI file (generated with samtools)

    2. Files created by the prepare step:

      • mikado_prepared.fasta
      • mikado_prepared.gtf
      • prepare.log
    3. transdecoder: this folder contains the results of the TransDecoder run against the mikado_prepared.fasta file. Mikado will use the file transcripts.fasta.transdecoder.bed as source for the ORF location.

    4. blast: this folder contains the BLAST data. In particular:

      • index: this folder contains the FASTA file of the protein database, and the BLAST database files.
      • fastas: Daijin will split mikado_prepared.fasta into multiple files, for easier runs onto a cluster. This folder contains the splitted FASTAs.
      • xmls: this folder contains the XML files corresponding to the BLASTs of the files present in fastas
      • logs: this folder contains the log files corresponding to the BLASTs of the files present in fastas
    5. pick: this folder contains the results of Mikado pick. It is organissed as follows:

      • One folder per requested Mikado chimera-splitting mode. Inside each folder, it is possible to find:

        • mikado-{mode}.loci.gff3: Final GFF3 output file.
        • mikado-{mode}.metrics.gff3: Final metrics output file, containing the metrics of the transcripts that have been selected.
        • mikado-{mode}.scores.gff3: Final metrics output file, containing the scores associated to the evaluated metrics, for each of the selected transcripts.
        • mikado-{mode}.loci.stats: statistics file derived using mikado util stats (see the section on this utility for details)
      • comparison.stats: this tabular file collects the most salient data from the statistics files generated for each pick mode.

Running the pipeline

Daijin executes the pipeline in two distinct phases, assemble and mikado. Both commands have the same command line interface, namely:

$ daijin assemble --help
   usage: daijin assemble [-h] [-c HPC_CONF] [-d] [--jobs N] [--cores [N]]
                       [--threads N] [--no_drmaa] [--rerun-incomplete]
                       [--forcerun TARGET [TARGET ...]] [--detailed-summary]
                       [--list] [--dag]
                       config

   positional arguments:
      config                Configuration file to use for running the transcript
                            assembly pipeline.

   optional arguments:
      -h, --help            show this help message and exit
      -c HPC_CONF, --hpc_conf HPC_CONF
                            Configuration file that allows the user to override
                            resource requests for each rule when running under a
                            scheduler in a HPC environment.
      -d, --dryrun          Do a dry run for testing.
      --jobs N, -J N        Maximum number of cluster jobs to execute
                            concurrently.
      --cores [N], -C [N]   Use at most N cores in parallel (default: 1000).
      --threads N, -t N     Maximum number of threads per job. Default: None (set
                            in the configuration file)
      --no_drmaa, -nd       Use this flag if you wish to run without DRMAA, for
                            example, if running on a HPC and DRMAA is not
                            available, or if running locally on your own machine
                            or server.
      --rerun-incomplete, --ri
                            Re-run all jobs the output of which is recognized as
                            incomplete.
      --forcerun TARGET [TARGET ...], -R TARGET [TARGET ...]
                            Force the re-execution or creation of the given rules
                            or files. Use this option if you changed a rule and
                            want to have all its output in your workflow updated.
      --detailed-summary, -D
                            Print detailed summary of all input and output files
      --list, -l            List resources used in the workflow
      --dag                 Do not execute anything and print the redirected
                            acylic graph of jobs in the dot language.

The available command parameters are:

  • config: the configuration file.
  • hpc_conf: cluster configuration file.
  • jobs: Maximum number of jobs that can be executed (if Daijin is in local mode) or be present in the submission queue (if Daijin is in DRMAA/cluster mode) at any one time.
  • dryrun: do not execute, just list all the commands that will be executed. Useful also for listing the rules that have to be executed.
  • cores: Maximum number of cores that Daijin can claim at any one time.
  • threads: Maximum number of cores/threads that can be assigned to any step of the pipeline.
  • rerun-incomplete: Ask Snakemake to check which steps have produced empty or incomplete output files, and re-execute them and all the downstream commands.
  • forcerun: force the re-run of a
Assemble

In the first step of the pipeline, Daijin will perform the following operations for each of the read datasets provided:

  1. Create the necessary indices for each of the aligner programs requested.
  2. Align the read dataset using all the different tools requested, in all the possible combinations of parameters requested.
    • For example, it is possible to ask each dataset to be aligned twice with TopHat2 - once with the “micro-exon” mode activated, the second time without. Both alignments will be run independently.
    • It is possible to specify which datasets are strand-specific and which are not, and moreover, it is possible to specify the kind of strand-specificity (fr-secondstrand, fr-firststrand).
  3. Call all the reliable junctions across the alignments using Portcullis.
  4. Create the statistics for the assembly using samtools stat, and merge them together in a single file.
  5. Assemble each alignment with all the tools requested, in all the parameter combinations desired.
  6. Call the statistics on each assembly using mikado util stats, and merge them together in a single file.
  7. Create the configuration file for Mikado.

So during this first step Daijin will go from raw reads files to multiple assemblies, and configure Mikado for the second step.

Assembly pipeline, as driven by Daijin

_images/daijin_assemble.svg

Example of a pipeline to assemble a single paired-end read dataset using one aligner (Hisat [Hisat]) and two different RNA-Seq assemblers (StringTie [StringTie] and CLASS2 [Class2] ). Reliable junctions from the three alignments are called and merged together using Portcullis.

Mikado

In this step, the Daijin manager will execute all the steps necessary to perform Mikado on the desired inputs. The manager will execute the following steps:

  1. Merge all the input assemblies together using Mikado prepare
  2. Execute TransDecoder [Trinity] on the transcript sequences, to retrieve their ORFs.
  3. Split the FASTA file in as many chunks as specified during configuration, and analyse them separately
  4. Execute BLASTX+ [Blastplus] on the splitted FASTAs, creating BLAST XML outputs.
  5. Run Mikado serialise to load the BLAST results, TransDecoder ORFs, and portcullis junctions into a single database.
  6. Run Mikado pick on the data, in the selected modes.
  7. Collate and collapse the statistics for each of the filtered assemblies.

daijin mikado by default should use as config the configuration file created by daijin assemble, which will be located in <output directory>/mikado.yaml.

Mikado pipeline, as driven by Daijin

_images/daijin_mikado.svg

Example of a typical Mikado pipeline. In this case the number of chunks for BLAST is limited - 10 - but we advise to increase this number for big datasets.

Hint

If you have already created some assemblies and wish to analyse them with Daijin, it is also possible to configure Mikado externally and use the resulting configuration file to guide Daijin. At the time of this writing, this is also the recommended protocol for including eg Pacbio or EST alignments.

Mikado provides a pipeline manager, Daijin, to align and assemble transcripts with multiple methods and subsequently choose the best assemblies among the options. The pipeline is implemented using Snakemake [Snake].

Miscellaneous utilities
Mikado miscellaneous scripts

All these utilities can be accessed with the mikado util CLI. They perform relatively minor tasks.

awk_gtf

This utility is used to retrieve specific regions from a GTF file, without breaking any transcript feature. Any transcript falling even partly within the specified coordinates will be retained in its entirety.

Usage:

$ mikado util awk_gtf --help
usage: mikado.py util awk_gtf [-h] (-r REGION | --chrom CHROM) [-as]
                              [--start START] [--end END]
                              gtf [out]


positional arguments:
  gtf
  out

optional arguments:
  -h, --help            show this help message and exit
  -r REGION, --region REGION
                        Region defined as a string like <chrom>:<start>..<end>
  --chrom CHROM
  -as, --assume-sorted
  --start START
  --end END``
class_codes

This utility is used to obtain information about any class code or category thereof.

Usage:

$ mikado util class_codes --help
usage: mikado util class_codes [-h]
                               [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                               [-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]]
                               [-o OUT]
                               [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} ...]]

Script to print out the class codes.

positional arguments:
  {[],=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u}
                        Codes to query.

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
  -c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...], --category {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]
  -o OUT, --out OUT
convert

This utility is used to convert between GTF and GFF3 files, with the possibility of giving as output BED12 files as well. It is limited to converting transcript features, and will therefore ignore any other feature present (transposons, loci, etc.). The output of the conversion to GFF3 is completely GFF3 compliant.

Usage:

$ mikado util convert --help
usage: mikado.py util convert [-h] [-of {bed12,gtf,gff3}] gf [out]

positional arguments:
  gf
  out

optional arguments:
  -h, --help            show this help message and exit
  -of {bed12,gtf,gff3}, --out-format {bed12,gtf,gff3}
grep

This utility extracts specific transcripts and genes from an input GTF/GFF3 file. As input, it requires a text file of either the format “<transcript id><tab><gene id>”, or simply gene per line (in which case the “–genes” switch has to be invoked). If only some of the transcripts of a gene are included in the text file, the gene feature will be shrunk accordingly. The name is an obvious homage to the invaluable UNIX command that we all love.

Usage:

$ mikado util grep --help
usage: mikado.py util grep [-h] [-v] [--genes] ids gff [out]

Script to extract specific models from GFF/GTF files.

positional arguments:
  ids         ID file (format: mrna_id, gene_id - tab separated)
  gff         The GFF file to parse.
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit
  -v          Exclude from the gff all the records in the id file.
  --genes     Flag. If set, the program expects as ids only a list of genes,
              and will exclude/include all the transcripts children of the
              selected genes.
merge_blast

This script merges together various XML BLAST+ files into a single entity. It might be of use when the input data has been chunked into different FASTA files for submission to a cluster queue. It is also capable of converting from ASN files and of dealing with GZipped files.

Usage:

$ mikado util merge_blast --help
usage: mikado.py util merge_blast [-h] [-v] [-l LOG] [--out [OUT]]
                                  xml [xml ...]

positional arguments:
  xml

optional arguments:
  -h, --help         show this help message and exit
  -v, --verbose
  -l LOG, --log LOG
  --out [OUT]
metrics

This command generates the documentation regarding the available transcript metrics. It is generated dynamycally by inspecting the code. The documentation in the introduction is generated using this utility.

Usage:

$ mikado util metrics --help
usage: mikado util metrics [-h]
                           [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                           [-o OUT]
                           [-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]]
                           [metric [metric ...]]

Simple script to obtain the documentation on the transcript metrics.

positional arguments:
  metric

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
                        Format of the table to be printed out.
  -o OUT, --out OUT     Optional output file
  -c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...], --category {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]
                        Available categories to select from.
stats

This command generates a statistics file for GFF3/GTF files. The output is a table including Average, Mode, and various quantiles for different features present in a typical GFF file (genes, introns, exons, cDNAs, etc.). The operation can be quite time consuming for large files, in which case it is advisable to ask for multiple processors.

Usage:

$ mikado util stats --help
usage: mikado.py util stats [-h] [--only-coding] [-p PROCS] gff [out]

GFF/GTF statistics script. It will compute median/average length of RNAs,
exons, CDS features, etc.

positional arguments:
  gff                   GFF file to parse.
  out

optional arguments:
  -h, --help            show this help message and exit
  --only-coding
  -p PROCS, --processors PROCS

A typical example statistics file can be found here, for the TAIR10 annotation.

trim

This utility trims down the terminal exons of multiexonic transcripts, until either shrinking them to the desired maximum length or meeting the beginning/end of the CDS. It has been used for generating the “trimmed” annotations for the analysis of the original Mikado paper.

Usage:

$ mikado util trim --help
usage: mikado.py util trim [-h] [-ml MAX_LENGTH] [--as-gtf] ann [out]

positional arguments:
  ann                   Reference GTF/GFF output file.
  out

optional arguments:
  -h, --help            show this help message and exit
  -ml MAX_LENGTH, --max_length MAX_LENGTH
                        Maximal length of trimmed terminal exons
  --as-gtf              Flag. If set, the output will be in GTF rather than
                        GFF3 format.
Included scripts

All the following scripts are included in the “util” folder in the source code, and will be included on the PATH after installation. Some of this scripts are used by the The Daijin pipeline for driving Mikado pipeline to produce statistics or perform other intermediate steps.

add_transcript_feature_to_gtf.py

This script is needed to add a top-level transcript feature to GTFs that lack it, eg. those produced by CuffMerge [CuffMerge].

Usage:

$ add_transcript_feature_to_gtf.py --help
usage: Script to add a transcript feature to e.g. Cufflinks GTFs
       [-h] gtf [out]

positional arguments:
  gtf         Input GTF
  out         Output file. Default: stdout.

optional arguments:
  -h, --help  show this help message and exit
align_collect.py

This script is used to collect statistics from samtools stat. Usage:

$ align_collect.py  --help
usage: Script to collect info from multiple samtools stats files
       [-h] input [input ...]

positional arguments:
  input       The list of samtools stats file to process

optional arguments:
  -h, --help  show this help message and exit
asm_collect.py

This script is used to collect statistics obtained with from the mikado util stats utility. Output is printed directly to the screen. Usage:

$ asm_collect.py -h
usage: Script to collect info from multiple mikado util stats files
       [-h] input [input ...]

positional arguments:
  input       The list of mikado util stats file to process

optional arguments:
  -h, --help  show this help message and exit
bam2gtf.py

This script will use PySam to convert read alignments into a GTF file. Mostly useful to convert from BAM alignment of long reads (eg. PacBio) into a format which Mikado can interpret and use.

Usage:

$ bam2gtf.py --help
usage: Script to convert from BAM to GTF, for PB alignments [-h] bam [out]

positional arguments:
  bam         Input BAM file
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit
class_run.py

Python3 wrapper for the CLASS [Class2] assembler. It will perform the necessary operations for the assembler (depth and call of the splicing junctions), and launch the program itself. Usage:

$ class_run.py --help
usage: Quick utility to rewrite the wrapper for CLASS. [-h] [--clean]
                                                       [--force]
                                                       [-c CLASS_OPTIONS]
                                                       [-p PROCESSORS]
                                                       [--class_help] [-v]
                                                       [bam] [out]

positional arguments:
  bam                   Input BAM file.
  out                   Optional output file.

optional arguments:
  -h, --help            show this help message and exit
  --clean               Flag. If set, remove tepmorary files.
  --force               Flag. If set, it forces recalculation of all
                        intermediate files.
  -c CLASS_OPTIONS, --class_options CLASS_OPTIONS
                        Additional options to be passed to CLASS. Default: no
                        additional options.
  -p PROCESSORS, --processors PROCESSORS
                        Number of processors to use with class.
  --class_help          If called, the wrapper will ask class to display its
                        help and exit.
  -v, --verbose
getFastaFromIds.py

Script to extract a list of sequences from a FASTA file, using the pyfaidx [PyFaidx] module. Usage:

$ getFastaFromIds.py -h
usage: getFastaFromIds.py [-h] [-v] list fasta [out]

A simple script that retrieves the FASTA sequences from a file given a list of
ids.

positional arguments:
  list           File with the list of the ids to recover, one by line.
                 Alternatively, names separated by commas.
  fasta          FASTA file.
  out            Optional output file.

optional arguments:
  -h, --help     show this help message and exit
  -v, --reverse  Retrieve entries which are not in the list, as in grep -v (a
                 homage).
gffjunc_to_bed12.py

Script to convert a GFF junction file to a BED12 file. Useful to format the input for Mikado serialise.

Usage:

$ gffjunc_to_bed12.py --help
usage: GFF=>BED12 converter [-h] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help  show this help message and exit
grep.py

A script to extract data from column files, using a list of targets. More efficient than a standard “grep -f” for this niche case.

Usage:

$ util/grep.py -h
usage: grep.py [-h] [-v] [-s SEPARATOR] [-f FIELD] [-q] ids target [out]

This script is basically an efficient version of the GNU "grep -f" utility for
table-like files, and functions with a similar sintax.

positional arguments:
  ids                   The file of patterns to extract
  target                The file to filter
  out                   The output file

optional arguments:
  -h, --help            show this help message and exit
  -v, --reverse         Equivalent to the "-v" grep option
  -s SEPARATOR, --separator SEPARATOR
                        The field separator. Default: consecutive
                        whitespace(s)
  -f FIELD, --field FIELD
                        The field to look in the target file.
  -q, --quiet           No logging.
merge_junction_bed12.py

This script will merge [Portcullis]-like junctions into a single BED12, using the thick start/ends as unique keys.

Usage:

$ merge_junction_bed12.py --help
usage: Script to merge BED12 files *based on the thickStart/End features*.
    Necessary for merging junction files such as those produced by TopHat
       [-h] [--delim DELIM] [-t THREADS] [--tophat] [-o OUTPUT] bed [bed ...]

positional arguments:
  bed                   Input BED files. Use "-" for stdin.

optional arguments:
  -h, --help            show this help message and exit
  --delim DELIM         Delimiter for merged names. Default: ;
  -t THREADS, --threads THREADS
                        Number of threads to use for multiprocessing. Default:
                        1
  --tophat              Flag. If set, tophat-like junction style is assumed.
                        This means that junctions are defined using the
                        blockSizes rather than thickStart/End. The script will
                        convert the lines to this latter format. By default,
                        the script assumes that the intron start/end are
                        defined using thickStart/End like in portcullis.
                        Mixed-type input files are not supported.
  -o OUTPUT, --output OUTPUT
                        Output file. Default: stdout
remove_from_embl.py

Quick script to remove sequences from a given organism from SwissProt files, and print them out in FASTA format. Used to produce the BLAST datasets for the Mikado paper. Usage:

$ remove_from_embl.py -h
usage: Script to remove sequences specific of a given organism from a SwissProt file.
       [-h] -o ORGANISM [--format {fasta}] input [out]

positional arguments:
  input
  out

optional arguments:
  -h, --help            show this help message and exit
  -o ORGANISM, --organism ORGANISM
                        Organism to be excluded
  --format {fasta}      Output format. Choices: fasta. Default: fasta.
sanitize_blast_db.py

Simple script to clean the header of FASTA files, so to avoid runtime errors and incrongruencies with BLAST and other tools which might be sensitive to long descriptions or the presence of special characters.

Usage:

$ sanitize_blast_db.py --help
usage: sanitize_blast_db.py [-h] [-o OUT] fasta [fasta ...]

positional arguments:
  fasta

optional arguments:
  -h, --help         show this help message and exit
  -o OUT, --out OUT
split_fasta.py

This script is used to split a FASTA file in a fixed number of files, with an approximate equal number of sequences in each. If the number of sequences in the input file is lower than the number of requested splits, the script will create the necessary number of empty files. Used in The Daijin pipeline for driving Mikado for preparing the input data for the BLAST analysis. Usage:

$ split_fasta.py --help
usage: Script to split FASTA sequences in a fixed number of multiple files.
       [-h] [-m NUM_FILES] fasta [out]

positional arguments:
  fasta                 Input FASTA file.
  out                   Output prefix. Default: filename+split

optional arguments:
  -h, --help            show this help message and exit
  -m NUM_FILES, --num-files NUM_FILES
                        Number of files to create. Default: 1000
trim_long_introns.py

This script parses an annotation file and truncates any transcript which has UTR introns over the provided threshold. In such cases, the UTR section after the long intron is simply removed. Usage:

$ trim_long_introns.py --help
usage: This script truncates transcript with UTR exons separated by long introns.
       [-h] [-mi MAX_INTRON] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help            show this help message and exit
  -mi MAX_INTRON, --max-intron MAX_INTRON
                        Maximum intron length for UTR introns.

Finally, Mikado provides some dedicated utilities to perform common tasks.

  • Some of the utilities are integral to the Mikado suite and can be accessed as subcommands of Mikado. These utilities comprise programs to calculate annotation statistics, retrieve or exclude specific loci from a file, etc.
  • Other utilities are provided as stand-alone scripts; while some of them directly depend on the Mikado library, this is not necessarily the case for them all.
Mikado core algorithms
Picking transcripts: how to define loci and their members

The Mikado pick algorithm

_images/Mikado_algorithm.jpeg

Schematic representation of how Mikado unbundles a complex locus into two separate genes.

Transcripts are scored and selected according to user-defined rules, based on many different features of the transcripts themselves (cDNA length, CDS length, UTR fraction, number of reliable junctions, etc.; please see the dedicated section on scoring for details on the scoring algorithm).

The detection and analysis of a locus proceeds as follows:

  1. When the first transcript is detected, Mikado will create a superlocus - a container of transcripts sharing the same genomic location - and assign the transcript to it.

  2. While traversing the genome, as long as any new transcript is within the maximum allowed flanking distance, it will be added to the superlocus.

  3. When the last transcript is added, Mikado performs the following preliminary operations:

    1. Integrate all the data from the database (including ORFs, reliable junctions in the region, and BLAST homology).

    2. If a transcript is monoexonic, assign or reverse its strand if the ORF data supports the decision

    3. If requested and the ORF data supports the operation, split chimeric transcripts - ie those that contain two or more non-overlapping ORFs on the same strand.

    4. Split the superlocus into groups of transcripts that:

      • share the same strand
      • have at least 1bp overlap
    5. Analyse each of these novel “stranded” superloci separately.

  4. Create subloci, ie group transcripts so to minimize the probability of mistakenly merging multiple gene loci due to chimeras. These groups are defined as follows:

    • if the transcripts are multiexonic, they must share at least one intron, inclusive of the borders
    • if the transcripts are monoexonic, they must overlap by at least 1bp.
    • Monoexonic and multiexonic transcripts cannot be part of the same sublocus.
  5. Select the best transcript inside each sublocus:

    1. Score the transcripts (see the section on scoring)
    2. Select as winner the transcript with the highest score and assign it to a monosublocus
    3. Discard any transcript which is overlapping with it, according to the definitions in the point above
    4. Repeat the procedure from point 2 until no transcript remains in the sublocus
  6. Monosubloci are gathered together into monosubloci holders, ie the seeds for the gene loci. Monosubloci holders have more lenient parameters to group transcripts, as the first phase should have already discarded most chimeras. Once a holder is created by a single monosublocus, any subsequent candidate monosublocus will be integrated only if the following conditions are satisfied:

    • if the candidate is monoexonic, its exon must overlap at least one exon of a transcript already present in the holder

    • if the candidate is multiexonic and the holder contains only monoexonic transcripts, apply the same criterion, ie check whether its exons overlap the exons of at least one of the transcripts already present

    • if the candidate is multiexonic and the holder contains multiexonic transcripts, check whether one of the following conditions is satisfied:

      • at least one intron of the candidate overlaps with an intron of a transcript in the holder
      • at least one intron of the candidate is completely contained within an exon of a transcript in the holder
      • at least one intron of a transcript in the holder is completely contained within an exon of a transcript in the holder.
      • the cDNA overlap and CDS overlap between the candidate and the transcript in the holder are over a specified threshold.

    Optionally, it is possible to tell Mikado to use a simpler algorithm, and integrate together all transcripts that share exon space. Such a simpler algorithm risks, however, chaining together multiple loci - especially in small, compact genomes.

  7. Once the holders are created, apply the same scoring and selection procedure of the sublocus selection step. The winning transcripts are assigned to the final loci. These are called the primary transcripts of the loci.

  8. Once the loci are created, track back to the original transcripts of the superlocus:

    1. discard any transcript overlapping more than one locus, as these are probably chimeras.

    2. For those transcripts that are overlapping to a single locus, verify that they are valid alternative splicing events using the class code of the comparison against the primary transcript. Transcripts are re-scored dynamically when they are re-added in this fashion, to ensure their quality when compared with the primary transcript.

      • For coding loci, transcripts will be added as alternative splicing events only if they are in the same frame as the primary transcript. New in version 1.5.
    3. If there are transcripts that do not overlap any of the final loci, create a new superlocus with the missed transcripts and perform the scoring and selection again on them, until no transcript is unaccounted for.

  9. After the alternative splicing events have been defined, Mikado can optionally “pad” them. See the padding section for details.

  10. Finally detect and either tag or discard fragments inside the initial superlocus (irrespective of strand):

    1. Check whether the primary transcript of any locus meets the criteria to be defined as a fragment (by default, maximum ORF of 30AA and maximum 2 exons - any transcript exceeding either criterion will be considered as non-fragment by default)
    2. If so, verify whether they are near enough any valid locus to be considered as a fragment (in general, class codes which constitute the “Intronic”, “Fragmentary” and “No overlap” categories).
    3. If these conditions are met, tag the locus as a fragment. If requested, Mikado will just discard these transcripts (advised).

These steps help Mikado identify and solve fusions, detect correctly the gene loci, and define valid alternative splicing events.

Definition of retained introns

When gathering transcripts into loci, Mikado will try to identify and tag transcripts that contain retained intron events. For our purposes, a retained intron event is an exon which:

  • is part of a coding transcript but is not completely coding itself.
  • if it is an internal exon, it completely spans the putative retained intron.
  • if it is a terminal exon, it must start within the exon of the putative retained intron, and terminate within the intron.
  • if it constitutes a monoexonic transcript, at least one of the two ends must reside within the bordering exons.

In addition to this, a transcript might be tagged as having its CDS disrupted by the retained intron event if:

  • the non-coding part of the exon is in the 3’UTR and it begins within the intron
  • the exon is 3’ terminal, coding and it ends within the intron.

Warning

The definition of a retained intron is stricty context dependent, i.e. the same exon will be regarded as a “retained intron” if the transcript is gathered together with other transcripts, but as non-retained if it were in isolation. It is therefore normal and expected that the associated metrics and scores will change, for a given transcript, across the various clustering stages.

Identification and breaking of chimeric transcripts

When a transcript contains more than one ORF, Mikado will try to determine whether this is due to a retained intron event or a frameshift (in which case the two ORFs are presumed to be mangled forms of an original, correct ORF for a single protein) or whether instead this is due to the fragment being polycystronic (in a prokaryote) or chimeric (in a eukaryote). The latter case is relatively common due to technical artefacts during sequencing and assembling of RNASeq reads.

A chimeric transcript is defined by Mikado as a model with multiple ORFs, where:

  • all the ORFs share the same strand
  • all the ORFs are non-overlapping.

In these situations, Mikado can try to deal with the chimeras in five different ways, in decreasingly conservative fashion:

  • nosplit: leave the transcript unchanged. The presence of multiple ORFs will affect the scoring.
  • stringent: leave the transcript unchanged, unless the two ORFs both have hits in the protein database and none of the hits is in common.
  • lenient: leave the transcript unchanged, unless either the two ORFs both have hits in the protein database, none of which is in common, or both have no hits in the protein database.
  • permissive: presume the transcript is a chimera, and split it, unless two ORFs share a hit in the protein database.
  • split: presume that every transcript with more than one ORF is incorrect, and split them.

If any BLAST hit spans the two ORFs, then the model will be considered as a non-chimera because there is evidence that the transcript constitutes a single unit. The only case when this information will be disregarded is during the execution of the split mode.

These modes can be controlled directly from the pick command line, or during the initial configuration stage.

Transcript measurements and scoring

In order to determine the best transcript for each locus, Mikado measures each available candidate according to various different metrics and assigning a specific score for each of those. Similarly to RAMPART [Rampart], Mikado will assign a score to each transcript for each metric by assessing it relatively to the other transcripts in the locus. The particular feature rescaling equation used for a given metric depends on the type of feature it represents:

  • metrics where higher values represent better transcript assemblies (“maximum”).
  • metrics where lower values represent better transcript assemblies (“minimum”)
  • metrics where values closer to a defined value represent better assemblies (“target”)

To allow for this tripartite scoring system with disparate input values, we have to employ rescaling equations so that each metric for each transcript will be assigned a score between 0 and 1. Optionally, each metric might be assigned a greater weight so that its maximum possible value will be greater or smaller than 1. Formally, let metric \(m\) be one of the available metrics \(M\), \(t\) a transcript in locus \(L\), \(w_{m}\) the weight assigned to metric \(m\), and \(r_{mt}\) the raw value of metric \(m\) for \(t\). Then, the score to metric \(m\) for transcript \(t\), \(s_{mt}\), will be derived using one of the following three different rescaling equations:

  • If higher values are best:
    \(s_{mt} = w_{m} * (\frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
  • If lower values are best:
    \(s_{mt} = w_{m} * (1 - \frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
  • If values closer to a target \(v_{m}\) are best:
    \(s_{mt} = w_{m} * (1 - \frac{|r_{mt} - v_{m}|}{max(|r_{m} - v_{m}|)})\)
Finally, the scores for each metric will be summed up to produce a final score for the transcript:
\(s_{t} = \sum_{m \forall m \in M} s_{mt}\).

Not all the available metrics will be necessarily used for scoring; the choice of which to employ and how to score and weight each of them is left to the experimenter, although Mikado provides some pre-configured scoring files. Values that are guaranteed to be between 0 and 1 (e.g. a percentage value) can be used directly as scores, by setting the use_raw parameter as true for them.

Details on the structure of scoring files can be found in a dedicated section; we also provide a tutorial on how to create your own scoring file.

Important

The scoring algorithm is dependent on the other transcripts in the locus, so each score should not be taken as an absolute measure of the reliability of a transcript, but rather as a measure of its relative goodness compared with the alternatives. Shifting a transcript from one locus to another can have dramatic effects on the scoring of a transcript, even while most or all of the underlying metric values remain unchanged. This is why the score assigned to each transcript changes throughout the Mikado run, as transcripts are moved to subloci, monoloci and finally loci.

Padding transcripts

Mikado has the ability of padding transcripts in a locus, so to uniform their starts and stops, and to infer the presence of missing exons from neighbouring data. The procedure is similar to the one employed by PASA and functions as follows:

  1. A transcript can function as template for a candidate if:
  • the candidate’s terminal exon falls within an exon of the template
  • the extension would enlarge the candidate by at most “ts_distance” basepairs (not including introns), default 2000 bps
  • the extension would add at most “ts_max_splices” splice sites to the candidate, default 2.
  1. A graph of possible extensions is built for both the 5’ end and the 3’ end of the locus. Transcripts are then divided in extension groups, starting with the outmost (ie the potential template for the group). Links that would cause chains (e.g. A can act as template for B and B can act as template for C, but A cannot act as template for C) are broken.
  1. Create a copy of the transcripts in the locus, for backtracking.
  2. Start expanding each transcript:
  1. Create a copy of the transcript for backtracking
  2. Calculate whether the 5’ terminal exon should be enlarged:
  • if the transcript exon terminally overlaps a template exon, enlarge it until the end of the template
  • If the template transcript has multiple exons upstream of the expanded exon, add those to the transcript.
  • Calculate the number of bases that have been added upstream to the cDNA of the transcript
  1. Calculate whether the 3’ terminal exon should be enlarged:
  • if the transcript exon terminally overlaps a template exon, enlarge it until the end of the template
  • If the template transcript has multiple exons downstream of the expanded exon, add those to the transcript.
  • Calculate the number of bases that have been added downstream to the cDNA of the transcript
  1. If the transcript is coding:
  1. Calculate the new putative CDS positions in the transcript, using the memoized amount of added basepairs downstream and upstream
  2. Calculate the new CDS, keeping the same frame as the original transcript. If the transcript is incomplete, this might lead to find the proper start and stop codons
  3. If we find an in-frame stop codon, the expansion would lead to an invalid transcript. Backtrack.
  1. Recalculate metrics and scores.
  2. Check whether we have made any transcript an invalid alternative splicing event; possible common causes include:
  • Having created a retained intron
  • Having expanded the number or size of the UTR so that the transcripts are no longer viable
  1. If any of the non-viable transcripts is either the primary transcript or one of the templates, remove the current templates from the locus and restart the analysis.
  2. Discard all the non-viable transcripts that are neither the primary nor templates.

When calculating the new ORF, Mikado will use the same codon table selected for the serialisation step.

This option is normally activated, with the parameters:

  • Default maximum splice sites that can be crossed: 2
  • Default maximum basepair distance: 1000

Note

please consider that the parameters above refer to the expansion on both sides of the transcript. So the parameters above allow transcripts to be expanded by up to 2000 bps, ie 1000 in both directions.

This option has been written for using Mikado in conjunction with ab initio predictions, but it can be used fruitfully also with transcript assemblies.

Warning

Please note that some of the metrics might become invalid after the padding. In particular, BLASTX results will be invalid as the query sequence will have changed.

The options related to padding can be found under the pick section in the configuration file.

Technical details

Most of the selection (ie “pick”) stage of the pipeline relies on the implementation of the objects in the loci submodule. In particular, the library defines an abstract class, “Abstractlocus”, which requires all its children to implement a version of the “is_intersecting” method. Each implementation of the method is specific to the stage. So the superlocus class will require in the “is_intersecting” method only overlap between the transcripts, optionally with a flanking and optionally restricting the groups to transcripts that share the same strand. The sublocus class will implement a different algorithm, and so on. The scoring is effectuated by first asking to recalculate the metrics (.calculate_metrics) and subsequently to calculate the scores (.calculate_scores). Mikado will try to cache and avoid recalculation of metrics and scores as much as possible, to make the program faster.

Metrics are an extension of the property construct in Python3. Compared to normal properties, they are distinguished only by three optional descriptive attributes: category, usable_raw, and rtype. The main reason to subclass property is to allow Mikado to be self-aware of which properties will be used for scoring transcripts, and which will not. So, for example, in the following snippet from the Transcript class definition:

@property
def combined_cds(self):
    """This is a list which contains all the non-overlapping CDS
    segments inside the cDNA. The list comprises the segments
    as duples (start,end)."""
    return self.__combined_cds

@combined_cds.setter
def combined_cds(self, combined):
    """
    Setter for combined_cds. It performs some basic checks,
    e.g. that all the members of the list are integer duplexes.

    :param combined: list
    :type combined: list[(int,int)]
    """

    if ((not isinstance(combined, list)) or
            any(self.__wrong_combined_entry(comb) for comb in combined)):
        raise TypeError("Invalid value for combined CDS: {0}".format(combined))

@Metric
def combined_cds_length(self):
    """This property return the length of the CDS part of the transcript."""
    c_length = sum([c[1] - c[0] + 1 for c in self.combined_cds])
    if len(self.combined_cds) > 0:
        assert c_length > 0
    return c_length

combined_cds_length.category = "CDS"

@Metric
def combined_cds_num(self):
    """This property returns the number of non-overlapping CDS segments
    in the transcript."""
    return len(self.combined_cds)

combined_cds_num.category = "CDS"

@Metric
def has_start_codon(self):
    """Boolean. True if the selected ORF has a start codon.
    :rtype: bool"""
    return self.__has_start_codon

@has_start_codon.setter
def has_start_codon(self, value):
    """Setter. Checks that the argument is boolean.
    :param value: boolean flag
    :type value: bool
    """

    if value not in (None, False, True):
        raise TypeError(
            "Invalid value for has_start_codon: {0}".format(type(value)))
    self.__has_start_codon = value

has_start_codon.category = "CDS"

Mikado will recognize that “derived_children” is a normal property, while “combined_cds_length”, “combined_cds_num” and “has_start_codon” are Metrics (and as such, we assign them a “category” - by default, that attribute will be None.). Please note that Metrics behave and are coded like normal properties in any other regard - including docstrings and setters/deleters.

The requirements expression is evaluated using eval.

Warning

While we took pains to ensure that the expression is properly sanitised and inspected before eval, Mikado might prove itself to be permeable to clever code injection attacks. Do not execute Mikado with super user privileges if you do not want to risk from such attacks, and always inspect third-party YAML scoring files before execution!

Scoring files

Mikado employs user-defined configuration files to define the desirable features in genes. These files are in TOML, YAML or JSON format (default YAML) and are composed of five sections:

  1. a requirements section, specifying the minimum requirements that a transcript must satisfy to be considered as valid. Any transcript failing these requirements will be scored at 0 and purged.
  2. a cds_requirements section, specifying minimal conditions for a transcript to retain its CDS. If a transcript fails this check, it will be stripped of its coding component. If this section is not provided, the default will be to copy the requirements section above (in practice disabling it).
  3. a not_fragmentary section, specifying the minimum requirements that the primary transcript of a locus has to satisfy in order for the locus not to be considered as a putative fragment.
  4. an as_requirements section, which specifies the minimum requirements for transcripts for them to be considered as possible valid alternative splicing events.
  5. a scoring section, specifying which features Mikado should look for in transcripts, and how each of them will be weighted.

Conditions are specified using a strict set of available operators and the values they have to consider.

Important

Although at the moment Mikado does not offer any method to establish machine-learning based scoring configurations, it is a topic we plan to investigate in the future. Mikado already supports Random Forest Regressors as scorers through Scikit-learn, but we have yet to devise a proper way to create such regressors.

We provide a guide on how to write your own scoring files in a separate tutorial.

Operators

Mikado allows the following operators to express a relationship inside the scoring files:

  • eq: equal to (\(=\)). Valid for comparisons with numbers, boolean values, and strings.
  • ne: different from (\(\neq\)). Valid for comparisons with numbers, boolean values, and strings.
  • lt: less than (\(<\)). Valid for comparisons with numbers.
  • gt: greater than (\(>\)). Valid for comparisons with numbers.
  • le: less or equal than (\(\le\)). Valid for comparisons with numbers.
  • ge: greater or equal than (\(\ge\)). Valid for comparisons with numbers.
  • in: member of (\(\in\)). Valid for comparisons with arrays or sets.
  • not in: not member of (\(\notin\)). Valid for comparisons with arrays or sets.
  • within: value comprised in the range of the two values, inclusive.
  • not within: value not comprised in the range of the two values, inclusive.

Mikado will fail if an operator not present on this list is specified, or if the operator is assigned to compare against the wrong data type (eg. eq with an array).

The “requirements”, “cds_requirements”, “as_requirements” and “not_fragmentary” sections

These sections specifies the minimum requirements for a transcript at various stages.

  • A transcript failing to pass the requirements check will be discarded outright (if “purge” is selected) or given a score of 0 otherwise.
  • After passing the rquirements section, if the transcript is coding, Mikado will check whether its CDS passes the requirements specified in cds_requirements. If the check fails, the transcripts will be stripped of its CDS, and will only be considered further as a non-coding transcript. This check can be used to properly consider transcripts that have a suspicious coding structure - e.g. a single coding exon in a transcript with five or more exons, or a low Coding Potential score coming from a third-party tool.
  • If a transcript has not been selected as the primary transcript of a locus, it has to pass the as_requirements check to be considered as a valid alternative splicing event.
  • Finally, after loci have been defined, the primary transcript of each locus will be checked against the not_fragmentary expression. Any locus failing this check will be marked as a potential fragment and, if in the vicinity of other loci, might be purged out of the final output or clearly marked as a fragment (depending on whether the purge switch is set to true or false, respectively).

It is strongly advised to use lenient parameters in the first requirements section, as failing to do so might result in discarding whole loci. Typically, transcripts filtered at this step should be obvious artefacts, eg monoexonic transcripts produced by RNA-Seq with a total length lower than the library fragment length.

Each of these sections follows the same template, and they are composed by two parts:

  • parameters: a list of the metrics to be considered. Each metric can be considered multiple times, by suffixing it with a “.<id>” construct (eg cdna_length.*mono* vs. cdna_length.*multi* to distinguish two uses of the cdna_length metric - once for monoexonic and once for multiexonic transcripts). Any parameter which is not a valid metric name, after removal of the suffix, will cause an error. Parameters have to specify the following:

    • a value that the metric has to be compared against
    • an operator that specifies the target operation. See the operators section.
  • expression: a string array that will be compiled into a valid boolean expression. All the metrics present in the expression string must be present in the parameters section. If an unrecognized metric is present, Mikado will crash immediately, complaining that the scoring file is invalid. Apart from brackets, Mikado accepts only the following boolean operators to chain the metrics:

    • and
    • or
    • not
    • xor

Hint

if no expression is specified, Mikado will construct one by chaining all the provided parameters with and and operator. Most of the time, this would result in an unexpected behaviour - ie Mikado assigning a score of 0 to most transcripts. It is strongly advised to explicitly provide a valid expression.

As an example, the following snippet replicates a typical requirements section found in a scoring file:

requirements:
  expression: [((exon_num.multi and cdna_length.multi and max_intron_length and min_intron_length), or,
    (exon_num.mono and cdna_length.mono))]
  parameters:
    cdna_length.mono: {operator: gt, value: 50}
    cdna_length.multi: {operator: ge, value: 100}
    exon_num.mono: {operator: eq, value: 1}
    exon_num.multi: {operator: gt, value: 1}
    max_intron_length: {operator: le, value: 20000}
    min_intron_length: {operator: ge, value: 5}

In the parameters section, we ask for the following:

  • exon_num.mono: monoexonic transcripts must have one exon (“eq”)
  • exon_num.multi: multiexonic transcripts must have more than one exon (“gt”)
  • cdna_length.mono: monoexonic transcripts must have a length greater than 50 bps (the “.mono” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
  • cdna_length.multi: multiexonic transcripts must have a length greater than or equal to 100 bps (the “.multi” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
  • max_intron_length: multiexonic transcripts should not have any intron longer than 200,000 bps.
  • min_intron_length: multiexonic transcripts should not have any intron smaller than 5 bps.

The expression field will be compiled into the following expression:

(exon_num > 1 and cdna_length >= 100 and max_intron_length <= 200000 and min_intron_length >= 5) or (exon_num == 1 and cdna_length > 50)

Any transcript for which the expression evaluates to false will be assigned a score of 0 outright and discarded, unless the user has chosen to disable the purging of such transcripts.

The scoring section

This section specifies which metrics will be used by Mikado to score the transcripts. Each metric to be used is specified as a subsection of the configuration, and will have the following attributes:

  • rescaling: the only compulsory attribute. It specifies the kind of scoring that will be applied to the metric, and it has to be one of “max”, “min”, or “target”. See the explanation on the scoring algorithm for details.

  • value: compulsory if the chosen rescaling algorithm is “target”. This should be either a number or a boolean value.

  • multiplier: the weight assigned to the metric in terms of scoring. This parameter is optional; if absent, as it is in the majority of cases, Mikado will consider the multiplier to equal to 1. This is the \(w_{m}\) element in the equations above.

  • filter: It is possible to specify a filter which the metric has to fulfill to be considered for scoring, eg, “cdna_length >= 200”. If the transcript fails to pass this filter, the score for this metric only will be set to 0. The filter can apply to a different metric, so it is possible to e.g. assign a score of 0 for, say, “combined_cds” to any transcript whose “cdna_length” is, say, below 150 bps. A “filter” subsection has to specify the following:

    • operator: the operator to apply for the boolean expression. See the relative section.
    • value: value that will be used to assess the metric.
    • metric: optional. The metric that this filter refers to. If omitted, this will be identical to the metric under examination.

Hint

the purpose of the filter section is to allow for fine-tuning of the scoring mechanism; ie it allows to penalise transcripts with undesirable qualities (eg a possible retained intron) without discarding them outright. As such, it is a less harsh version of the requirements section and it is the preferred way of specifying which transcript features Mikado should be wary of.

For example, this is a snippet of a scoring section:

scoring:
    blast_score: {rescaling: max}
    cds_not_maximal: {rescaling: min}
    combined_cds_fraction: {rescaling: target, value: 0.8, multiplier: 2}
    five_utr_length:
        filter: {operator: le, value: 2500}
        rescaling: target
        value: 100
    end_distance_from_junction:
        filter: {operator: lt, value: 55}
        rescaling: min
    non_verified_introns_num:
        rescaling: max
        multiplier: -10
        filter:
            operator: gt
            value: 1
            metric: exons_num

Using this snippet as a guide, Mikado will score transcripts in each locus as follows:

  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the greatest blast_score
  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest amount of CDS bases in secondary ORFs (cds_not_maximal)
  • Assign a full score (two points, as a multiplier of 2 is specified) to transcripts that have a total amount of CDS bps approximating 80% of the transcript cDNA length (combined_cds_fraction)
  • Assign a full score (one point, as no multiplier is specified) to transcripts that have a 5’ UTR whose length is nearest to 100 bps (five_utr_length); if the 5’ UTR is longer than 2,500 bps, this score will be 0 (see the filter section)
  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest distance between the CDS end and the most downstream exon-exon junction (end_distance_from_junction). If such a distance is greater than 55 bps, assign a score of 0, as it is a probable target for NMD (see the filter section).
  • Assign a maximum penalty (minus 10 points, as a negative multiplier is specified) to the transcript with the highest number of non-verified introns in the locus.
    • Again, we are using a “filter” section to define which transcripts will be exempted from this scoring (in this case, a penalty)
    • However, please note that we are using the keyword metric in this section. This tells Mikado to check a different metric for evaluating the filter. Nominally, in this case we are excluding from the penalty any monoexonic transcript. This makes sense as a monoexonic transcript will never have an intron to be confirmed to start with.

Note

The possibility of using different metrics for the “filter” section is present from Mikado 1.3 onwards.

Metrics

These are all the metrics available to quantify transcripts. The documentation for this section has been generated using the metrics utility.

Metrics belong to one of the following categories:

  • Descriptive: these metrics merely provide a description of the transcript (eg its ID) and are not used for scoring.
  • cDNA: these metrics refer to basic features of any transcript such as its number of exons, its cDNA length, etc.
  • Intron: these metrics refer to features related to the number of introns and their lengths.
  • CDS: these metrics refer to features related to the CDS assigned to the transcript.
  • UTR: these metrics refer to features related to the UTR of the transcript. In the case in which a transcript has been assigned multiple ORFs, unless otherwise stated the UTR metrics will be derived only considering the selected ORF, not the combination of all of them.
  • Locus: these metrics refer to features of the transcript in relationship to all other transcripts in its locus, eg how many of the introns present in the locus are present in the transcript. These metrics are calculated by Mikado during the picking phase, and as such their value can vary during the different stages as the transcripts are shifted to different groups.
  • External: these metrics are derived from accessory data that is recovered for the transcript during the run time. Examples include data regarding the number of introns confirmed by external programs such as Portcullis, or the BLAST score of the best hits.
  • Attributes: these metrics are extracted at runtime from attributes present in the input files. An example of this could be the TPM or FPKM values assigned to transcripts by rna expression analysis software.

Hint

Starting from version 1 beta8, Mikado allows to use externally defined metrics for the transcripts. These can be accessed using the keyword “external.<name of the metrics>” within the configuration file. See the relevant section for details.

Hint

Starting from version 2, Mikado allows to use attribute defined metrics for the transcripts. These can be accessed using the keyword “attributes.<name of the metric>” within the scoring file. See the relevant section for details.

Important

Starting from Mikado 1 beta 8, it is possible to use metrics with values between 0 and 1 directly as scores, without rescaling. From Mikado 2, it is also possible to treat values that are between 0 and 100 in the same way; Mikado will convert them automatically to be between 0 and 1, internally. It is also possible to instruct Mikado, during scoring, to use certain values as percentages by adding the percentage = True for the scoring parameter.

Metric name Description Category Data type Usable raw
         
tid ID of the transcript - cannot be an undefined value. Alias of id. Descriptive str False
parent Name of the parent feature of the transcript. Descriptive str False
score Numerical value which summarizes the reliability of the transcript. Descriptive str False
external_scores SPECIAL this Namespace contains all the information regarding external scores for the transcript. If an absent property is not defined in the Namespace, Mikado will set a default value of 0 into the Namespace and return it. External Namespace True
alias This property returns the alias of the transcript, if present, else its ID Descriptive str False
best_bits Metric that returns the best BitS associated with the transcript. External float False
blast_identity This metric will return the alignment identity for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: External float True
blast_query_coverage This metric will return the query coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: External float True
blast_score Interchangeable alias for testing different blast-related scores. Current: best bit score. External float False
blast_target_coverage This metric will return the target coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: External float True
canonical_intron_proportion This metric returns the proportion of canonical introns of the transcript on its total number of introns. Intron float True
cdna_length This property returns the length of the transcript. cDNA int False
cds_disrupted_by_ri This property describes whether the CDS is interrupted within a retained intron. Locus bool True
cds_not_maximal This property returns the length of the CDS excluded from the selected ORF. CDS int False
cds_not_maximal_fraction This property returns the fraction of bases not in the selected ORF compared to the total number of CDS bases in the cDNA. CDS float True
combined_cds_fraction This property return the percentage of the CDS part of the transcript vs. the cDNA length CDS float True
combined_cds_intron_fraction This property returns the fraction of CDS introns of the transcript vs. the total number of CDS introns in the Locus. If the transcript is by itself, it returns 1. Locus float True
combined_cds_length This property return the length of the CDS part of the transcript. CDS int False
combined_cds_locus_fraction This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. Locus float True
combined_cds_num This property returns the number of non-overlapping CDS segments in the transcript. CDS int False
combined_cds_num_fraction This property returns the fraction of non-overlapping CDS segments in the transcript vs. the total number of exons CDS float True
combined_utr_fraction This property returns the fraction of the cDNA which is not coding according to any ORF. Complement of combined_cds_fraction UTR float True
combined_utr_length This property return the length of the UTR part of the transcript. UTR int False
end_distance_from_junction This metric returns the cDNA distance between the stop codon and the last junction in the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. This metric considers the combined CDS end. CDS int False
end_distance_from_tes This property returns the distance of the end of the combined CDS from the transcript end site. If no CDS is defined, it defaults to 0. CDS int False
exon_fraction This property returns the fraction of exons of the transcript which are contained in the sublocus. If the transcript is by itself, it returns 1. Set from outside. Locus float True
exon_num This property returns the number of exons of the transcript. cDNA int False
five_utr_length Returns the length of the 5’ UTR of the selected ORF. UTR float False
five_utr_num This property returns the number of 5’ UTR segments for the selected ORF. UTR int False
five_utr_num_complete This property returns the number of 5’ UTR segments for the selected ORF, considering only those which are complete exons. UTR int False
has_start_codon Boolean. True if the selected ORF has a start codon. CDS bool False
has_stop_codon Boolean. True if the selected ORF has a stop codon. CDS bool False
highest_cds_exon_number This property returns the maximum number of CDS segments among the ORFs; this number can refer to an ORF DIFFERENT from the maximal ORF. CDS int False
highest_cds_exons_num Returns the number of CDS segments in the selected ORF (irrespective of the number of exons involved) CDS int False
intron_fraction This property returns the fraction of introns of the transcript vs. the total number of introns in the Locus. If the transcript is by itself, it returns 1. Set from outside. Locus float True
is_complete Boolean. True if the selected ORF has both start and end. CDS bool False
is_reference Checks whether the transcript has been marked as reference by Mikado prepare External bool False
max_exon_length This metric will return the length of the biggest exon in the transcript. cDNA int False
max_intron_length This property returns the greatest intron length for the transcript. Intron int False
min_exon_length This metric will return the length of the biggest exon in the transcript. cDNA int False
min_intron_length This property returns the smallest intron length for the transcript. Intron int False
non_verified_introns_num This metric returns the number of introns of the transcript which are not validated by external data. External int False
num_introns_greater_than_max This metric returns the number of introns greater than the maximum acceptable intron size indicated in the constructor. Intron int False
num_introns_smaller_than_min This metric returns the number of introns smaller than the mininum acceptable intron size indicated in the constructor. Intron int False
number_internal_orfs This property returns the number of ORFs inside a transcript. CDS int False
only_non_canonical_splicing This metric will return True if the canonical_number is 0 Intron bool False
original_source This property returns the original source assigned to the transcript (before Mikado assigns its own final source value). Descriptive str False
proportion_verified_introns This metric returns, as a fraction, how many of the transcript introns are validated by external data. External float True
proportion_verified_introns_inlocus This metric returns, as a fraction, how many of the verified introns inside the Locus are contained inside the transcript. In loci without any verified introns, this metric will be set to 1. Locus float True
retained_fraction This property returns the fraction of the cDNA which is contained in retained introns. Locus float True
retained_intron_num This property records the number of introns in the transcripts which are marked as being retained. See the corresponding method in the sublocus class. Locus int False
selected_cds_exons_fraction Returns the fraction of CDS segments in the selected ORF (irrespective of the number of exons involved) CDS float True
selected_cds_fraction This property calculates the fraction of the selected CDS vs. the cDNA length. CDS float True
selected_cds_intron_fraction This property returns the fraction of CDS introns of the selected ORF of the transcript vs. the total number of CDS introns in the Locus (considering only the selected ORF). If the transcript is by itself, it should return 1. CDS float True
selected_cds_length This property calculates the length of the CDS selected as best inside the cDNA. CDS int False
selected_cds_locus_fraction This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. Locus float True
selected_cds_num This property calculates the number of CDS exons for the selected ORF CDS int False
selected_cds_number_fraction This property returns the proportion of best possible CDS segments vs. the number of exons. See selected_cds_number. CDS float False
selected_end_distance_from_junction This metric returns the distance between the stop codon and the last junction of the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps, otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. CDS int False
selected_end_distance_from_tes This property returns the distance of the end of the best CDS from the transcript end site. If no CDS is defined, it defaults to 0. CDS int False
selected_start_distance_from_tss This property returns the distance of the start of the best CDS from the transcript start site. If no CDS is defined, it defaults to 0. CDS int False
snowy_blast_score Metric that indicates how good a hit is compared to the competition, in terms of BLAST similarities. As in SnowyOwl, the score for each hit is calculated by taking the coverage of the target and dividing it by (2 * len(self.blast_hits)). IMPORTANT: when splitting transcripts by ORF, a blast hit is added to the new transcript only if it is contained within the new transcript. This WILL screw up a bit the homology score. External float False
source_score This metric returns a score that is assigned to the transcript in virtue of its origin. External float False
start_distance_from_tss This property returns the distance of the start of the combined CDS from the transcript start site. If no CDS is defined, it defaults to 0. CDS int False
suspicious_splicing This metric will return True if the transcript either has canonical introns on both strands (probably a chimeric artifact between two neighbouring loci, or if it has no canonical splicing event but it would if it were assigned to the opposite strand (probably a strand misassignment on the part of the assembler/predictor). Intron bool False
three_utr_length Returns the length of the 5’ UTR of the selected ORF.   int False
three_utr_num This property returns the number of 3’ UTR segments (referred to the selected ORF). UTR int False
three_utr_num_complete This property returns the number of 3’ UTR segments for the selected ORF, considering only those which are complete exons. UTR int False
utr_fraction This property calculates the length of the UTR of the selected ORF vs. the cDNA length. UTR float True
utr_length Returns the sum of the 5’+3’ UTR lengths UTR int False
utr_num Returns the number of UTR segments (referred to the selected ORF). UTR int False
utr_num_complete Returns the number of UTR segments which are complete exons (referred to the selected ORF). UTR int False
verified_introns_num This metric returns the number of introns of the transcript which are validated by external data. External int False
External metrics

Starting from version 1 beta 8, Mikado allows to load external metrics into the database, to be used for evaluating transcripts. Metrics of this kind must have a value comprised between 0 and 1. The file can be provided either by specifying it in the coonfiguration file, under “serialise/files/external_scores”, or on the command line with the “–external-scores” parameters to mikado serialise. The external scores file should have the following format:

TID Metric_one Metric_two Metric_N
Transcript_one value value   value
Transcript_two value value   value
 
Transcript_N value value   value

Please note the following:

  • the header is mandatory.
  • the metric names at the head of the table should not contain any space or spcecial characters, apart from the underscore (_)
  • the header provides the name of the metric as will be seen by Mikado. As such, it is advised to choose sensible and informative names (e.g. “fraction_covered”) rather than uninformative ones (e.g. the “metric_one” from above)
  • Column names must be unique.
  • The transcript names present in the first column must be present in the FASTA file.
  • The table should be tab-separated.
  • Values can be of any numerical or boolean type. However, only values that are determined at serialisation to be comprised within 0 and 1 (inclusive), or between 0 and 100 (ie percentages) can be used as raw values.

A proper way of generating and using external scores would, therefore, be the following:

  • Run Mikado prepare on the input dataset.
  • Run all necessary supplementary analyses (ORF calling and/or homology analysis through DIAMOND or BLAST).
  • Run supplementary analyses to assess the transcripts, e.g. expression analysis. Normalise results so that they can be expressed with values between 0 and 1.
    • Please note that boolean results (e.g. presence or absence) can be expressed with 0 and 1 intead of “False” and “True”. Customarily, in Python 0 stands for False and 1 for True, but you can choose to switch the order if you so desire.
  • Aggregate all results in a text table, like the one above, tab separated.
  • Call mikado serialise, specifying the location of this table either through the configuration file or on the command line invocation.

Given the open ended nature of the external scores, the Daijin pipeline currently does not offer any system to generate these scores. This might change in the future.

Note

our laboratory has implemented a novel pipeline, Minos (https://github.com/EI-CoreBioinformatics/minos) for integrating multiple sources of evidence using Mikado. Internally, Minos makes use of the external metrics described here. We recommend having a look at the pipeline for inspiration.

Adding external scores to the scoring file

Once the external metrics have been properly loaded, it is necessary to tell Mikado how to use them. This requires modifying the scoring file itself. The header that we used in the table above does provide the names of the metrics as they will be seen by Mikado.

Let us say that we have performed an expression analysis on our transcripts, and we have created and loaded the following three metrics:

  • “fraction_covered”, ie the percentage of the transcript covered by at least X reads (where X is decided by the experimenter)
  • “samples_expressed”, ie the percentage of samples where the transcript was expressed over a certain threshold (e.g. 1 TPM)
  • “has_coverage_gaps”, ie a boolean metrics that indicates whether there are windows within the transcript that lowly or not at all covered (e.g. a 100bp stretch with no coverage between two highly covered regions, indicating a possilble intron retention or chimera). For this example, a value of “0” indicates that there no coverage gaps (ie. it is False that there are gaps), “1” otherwise (it is True that there are coverage gaps).

We can now use these metrics as normal, by invoking them as “external.” followed by the name of the metrics: e.g., “external.fraction_covered”. So for example, if we wished to prioritise transcripts that are expressed in the highest number of samples and are completely covered by RNASeq data upon reads realignment, under “scoring”, we can add the following:

scoring:
    # [ ... other metrics ... ]
    - external.samples_expressed: {rescaling: max}
    - external.fraction_covered: {rescaling: max}

And if we wanted to consider any primary transcript with coverage gaps as a potential fragment, under the “fragmentary” section we could do so:

not_fragmentary:
  expression:
    # other metrics ..
    - and (external.has_coverage_gaps)
    # Finished expression
  parameters:
    # other metrics ...
    external.has_coverage_gaps: {operator: eq, value: 0}  # Please note, again, that here "0" means "no coverage gaps detected".
    # other metrics ...

As external metrics allow Mikado to accept any arbitrary metric for each transcript, they allow the program to assess transcripts in any way the experimenter desires. However, currently we do not provide any way of automating the process.

Note

also for external metrics, it is necessary to add a suffix to them if they are invoked more than once in an expression (see the tutorial). An invocation of e.g. “external.samples_expressed.mono” and “external.samples_expressed.multi”, to distinguish between monoexonic and multiexonic transcripts, would be perfectly valid and actually required by Mikado. Notice the double use of the dot (“.”) as separator. Its usage as such is the reason that it cannot be present in the name of the metric itself (so, for example, “has.coverage.gaps” would be an invalid metric name).

Attributes metrics

Starting from version 2, Mikado allows the usage of metrics defined in the attributes of the input files, these metrics behave as the rest of the metrics but they are gathered at runtime from the input datasets. It is important to note that these metrics must be equivalent in all the inputs and are by default initialised to “0” when a transcript does not have an attribute defining the metric. The default initialisation value can be overridden in the scoring file.

Attribute metrics along with the required rescaling parameter, can define an rtype parameter as one of (float, int or bool) which will be used to cast the value of the attribute internally. Currently the types available are: integers (int), reals (float) and booleans (bool). The rtype will define how the value in the attribute will be represented or treated internally in mikado, i.e an attribute has a value of “3.2” but the rtype is defined as int, this value will be casted from a real number to an integer which usually in python just truncates to “3” for any number between 3 and 4. If the type was treated as a ‘float’ the internal value in mikado for this attribute would be “3.2”, finally were this parameter’s rtype be defined as bool it’s value internally in mikado would be ‘True’ which is the case for any number not equal to “0”. Finally, a percentage boolean which indicates that the values are in the 0-100 range and enables a transformation to the 0-1 range so that these can be used as ‘raw’ scores (see the scoring algorithm section).

An example for the usage of these metrics could be:

Chr5    Cufflinks       transcript      26581218        26583874        1000    -       .       gene_id "cufflinks_star_at.23551";transcript_id "cufflinks_star_at.23551.1";exon_number "1";FPKM "0.4343609420";conf_hi "0.577851";frac "0.751684";cov "11.982854";conf_lo "0.293994";percentage_score "42.42"
Chr5    Cufflinks       exon    26581218        26581528        .       -       .       gene_id "cufflinks_star_at.23551";transcript_id "cufflinks_star_at.23551.1";
Chr5    Cufflinks       exon    26583335        26583874        .       -       .       gene_id "cufflinks_star_at.23551";transcript_id "cufflinks_star_at.23551.1";

If the scoring file defines:

scoring:
    # [ ... other metrics ... ]
    - attributes.FPKM: {rescaling: max}
    - attributes.frac: {rescaling: max, use_raw: true}
    - attributes.percentage_score: {rescaling: max, use_raw: true, percentage: true}
    - attributes.cov: {rescaling: max, use_raw: true, multiplier: 1, rtype: int, default: 1}

The same scoring rules defined previously will apply to metrics obtained from the transcript’s attributes.

References
[Mikado]Leveraging multiple transcriptome assembly methods for improved gene structure annotation Luca Venturini, Shabhonam Caim, Gemy George Kaithakottil, Daniel Lee Mapleson, David Swabreck. GigaScience, 2018. doi: 10.1093/gigascience/giy093
[ParsEval]ParsEval: parallel comparison and analysis of gene structure annotations Daniel S Standage and Volker P Brendel. BMC Bioinformatics, 2012, doi:10.1186/1471-2105-13-187
[RGASP]Assessment of transcript reconstruction methods for RNA-seq Tamara Steijger, Josep F Abril, Pär G Engström, Felix Kokocinski, The RGASP Consortium, Tim J Hubbard, Roderic Guigó, Jennifer Harrow & Paul Bertone. Nature Methods, 2013, doi:10.1038/nmeth.2714
[SnowyOwl]SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models Ian Reid, Nicholas O’Toole, Omar Zabaneh, Reza Nourzadeh, Mahmoud Dahdouli, Mostafa Abdellateef, Paul MK Gordon, Jung Soh, Gregory Butler, Christoph W Sensen and Adrian Tsang. BMC Bioinformatics, 2014, doi:10.1186/1471-2105-15-229
[CuffMerge]Identification of novel transcripts in annotated genomes using RNA-Seq Adam Roberts, Harold Pimentel, Cole Trapnell and Lior Pachter. Bioinformatics, 2011, doi:10.1093/bioinformatics/btr355
[Class2]CLASS2: accurate and efficient splice variant annotation from RNA-seq reads Li Song, Sarven Sabunciyan and Liliana Florea. Bioinformatics, 2016, doi:10.1093/nar/gkw158
[PyFaidx]Efficient “pythonic” access to FASTA files using pyfaidx Matthew D Shirley​, Zhaorong Ma, Brent S Pedersen and Sarah J Wheelan. PeerJ PrePrints 3:e1196, 2015. doi:10.7287/peerj.preprints.970v1
[Snake]Snakemake—a scalable bioinformatics workflow engine Johannes Köster and Sven Rahmann1. Bioinformatics, 2012, doi:10.1093/bioinformatics/bts480
[Trinity]De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Brian J Haas, et al. Nature Protocols, 2013. doi:10.1038/nprot.2013.084
[Blastplus]BLAST+: architecture and applications. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BMC Bioinformatics, 2009. doi:10.1186/1471-2105-10-421
[Diamond]Fast and sensitive protein alignment using DIAMOND B Buchfink, C Xie, D H Huson. Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176
[STAR]STAR: ultrafast universal RNA-seq aligner Alexander Dobin, Carrie A. Davis1, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha1, Philippe Batut1, Mark Chaisson and Thomas R. Gingeras. Bioinformatics, 2012. doi:10.1093/bioinformatics/bts635
[Hisat]HISAT: a fast spliced aligner with low memory requirements Daehwan Kim, Ben Langmead and Stevan L Salzberg. Nature Methods, 2015. doi:10.1038/nmeth.3317
[TopHat2]TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley and Steven L Salzberg. Genome Biology, 2013. doi:10.1186/gb-2013-14-4-r36
[StringTie]StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell and Steven L Salzberg. Nature Biotechnology, 2015. doi:10.1038/nbt.3122
[GMAP]GMAP: a genomic mapping and alignment program for mRNA and EST sequences Thomas D. Wu and Colin K. Watanabe. Bioinformatics 2005. doi:10.1093/bioinformatics/bti310
[uORFs]uAUG and uORFs in human and rodent 5′untranslated mRNAs. Michele Iacono, Flavio Mignone and Graziano Pesole. Gene, 2005. doi:10.1016/j.gene.2004.11.041
[PyYaml]Pyyaml library K Simonov. http://pyyaml.org/, 2006.
[Cython]Cython: the best of both worlds. Stefan Behnel, RObert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn and Kurt Smith. AIP - Computing in science & engineering, 2011. doi:10.1109/MCSE.2010.118
[Numpy]The NumPy array: A structure for efficient numerical computation. Stefan van der Walt, S. Chris Colbert and Gael Varoquaux. Computing in Science & Engineering, 2011. doi:10.1109/MCSE.2011.37
[Scipy]SciPy: Open Source Scientific Tools for Python. Eric Jones and Travis Oliphant and Pearu Peterson et al. http://www.scipy.org/*, 2001.
[NetworkX]Exploring network structure, dynamics, and function using NetworkX Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart. Proceedings of the 7th Python in Science Conference (SciPy2008), 2008. doi:
[BioPython]Biopython: freely available Python tools for computational molecular biology and bioinformatics. PA Cock, T Antao, JT Chang, BA Bradman, CJ Cox, A Dalke, I Friedberg, T Hamelryck, F Kauff, B Wilczynski and MJL de Hoon. Bioinformatics, 2009. doi:10.1093/bioinformatics/btp163
[DRMAA]Distributed resource management application API Version 2 (DRMAA). P Tröger, R Brobst, D Gruber, M Mamonski and D Templeton. Open Grid Forum, 2012. doi:
[Apollo]Web Apollo: a web-based genomic annotation editing platform Eduardo Lee, Gregg A Helt, Justin T Reese, Monica C Munoz-Torres, Chris P Childers, Robert M Buels, Lincoln Stein, Ian H Holmes, Christine G Elsik and Suzanna E Lewis. Genome Biology, 2013, doi:10.1186/gb-2013-14-8-r93
[Rampart]RAMPART: a workflow management system for de novo genome assembly, Daniel Mapleson, Nizar Drou and David Swarbreck. Bioinformatics, 2015. doi:10.1093/bioinformatics/btv056
[Uniprot]UniProt: a hub for protein information The UniProt Consortium. Nucleic Acid Research, 2014. doi:10.1093/nar/gku989
[Portcullis]Efficient and accurate detection of splice junctions from RNA-seq with Portcullis Daniel Lee Mapleson, Luca Venturini, Gemy George Kaithakottil, David Swarbreck. GigaScience, 2018. doi: 10.1093/gigascience/giy131
[Oases]Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels Marcel H Schulz, Daniel R Zerbino, Martin Vingron and Ewan Birney. Bioinformatics, 2012. doi: 10.1093/bioinformatics/bts094
[Bridger]Bridger: a new framework for de novo transcriptome assembly using RNA-seq data Zheng Chang, Guojun Li, Juntao Liu, Cody Ashby, Deli Liu, Carole L Cramer and Xiuzhen Huang. Genome Biology, 2015. doi:10.1186/s13059-015-0596-2
[EviGeneTobacco]Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana Kenlee Nakasugi, Ross Crowhurst, Julia Bally, Peter Waterhouse. PLoS ONE, 2014. doi:10.1371/journal.pone.0091776
[TransAbyss]De novo assembly and analysis of RNA-seq data Gordon Robertson et al., Nature Methods, 2010. doi:10.1038/nmeth.1517
[Transrate]TransRate: reference-free quality assessment of de novo transcriptome assemblies Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M. Hibberd and Steven Kelly. Genome Research, 2016. doi:10.1101/gr.196469.115
[RSEMeval]Evaluation of de novo transcriptome assemblies from RNA-Seq data Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron Stewart and Colin N Dewey. Genome Biology, 2014. doi:10.1186/s13059-014-0553-5
[Maker]MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491
[EVM]Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments Brian J Haas, Steven L Salzberg, Wei Zhu, Mihaela Pertea, Jonathan E Allen, Joshua Orvis, Owen White, C Robin Buell and Jennifer R Wortman. Genome Biology, 2008. doi:10.1186/gb-2008-9-1-r7
[Augustus]WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes Katharina J. Hoff and Mario Stanke. Nucleic Acid Research, 2013. doi:10.1093/nar/gkt418
[Maker2]MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491
[AraPort]Araport11: a complete reannotation of the Arabidopsis thaliana reference genome Chia‐Yi Cheng, Vivek Krishnakumar, Agnes P. Chan, Françoise Thibaud‐Nissen, Seth Schobel Christopher and D. Town. The Plant Journal, 2017, Volume 89, Issue 4, 789-804. doi: 10.1111/tpj.13415
[PYinterval]https://github.com/chaimleib/intervaltree
[BXPython]https://bitbucket.org/james_taylor/bx-python/overview
[Snakeviz]https://jiffyclub.github.io/snakeviz/
[PASA]Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. Nucleic Acids Res, 2003, 31, 5654-5666. doi:10.1093/nar/gkg770
[GffRead]GFF Utilities: GffRead and GffCompare Pertea, G. and Pertea, M. F1000, 2020, 9, ISCB Comm J-304. doi:10.12688/f1000research.23297.2
[Prodigal]Prodigal: prokaryotic gene recognition and translation initiation site identification. Hyatt, D., Chen, GL., LoCascio, P.F. et al. BMC Bioinformatics 11, 119 (2010). doi:10.1186/1471-2105-11-119
[TransDecoder]https://github.com/TransDecoder/TransDecoder/wiki
[Pandas]pandas-dev/pandas: Pandas, The pandas development team. Zenodo, March 2021. doi:10.5281/zenodo.4572994
[PySAM]https://github.com/pysam-developers/pysam
Mikado
Mikado package
Subpackages
Mikado.configuration package
Submodules
Module contents
Mikado.daijin package
Submodules
Module contents
Mikado.loci package
Submodules
Module contents
Mikado.parsers package
Submodules
Module contents
Mikado.picking package
Submodules
Module contents
Mikado.preparation package
Submodules
Module contents
Mikado.scales package
Subpackages
Submodules
Module contents
Mikado.serializers package
Subpackages
Mikado.serializers.blast_serializer package
Submodules
Module contents
Submodules
Module contents
Mikado.subprograms package
Subpackages
Mikado.subprograms.util package
Submodules
Module contents
Submodules
Module contents
Mikado.transcripts package
Subpackages
Mikado.transcripts.transcript_methods package
Submodules
Module contents
Submodules
Module contents
Mikado.utilities package
Submodules
Module contents
Submodules
Module contents