_images/mikado-logo.png

Mikado: pick your transcript

python_badge snake_badge

Authors:Venturini Luca, Caim Shabhonam, Mapleson Daniel, Kaithakottil Gemy George, Swarbreck David
Version:1.0.1 (April 2017)

Mikado is a lightweight Python3 pipeline to identify the most useful or “best” set of transcripts from multiple transcript assemblies. Our approach leverages transcript assemblies generated by multiple methods to define expressed loci, assign a representative transcript and return a set of gene models that selects against transcripts that are chimeric, fragmented or with short or disrupted CDS. Loci are first defined based on overlap criteria and each transcript therein is scored based on up to 50 available metrics relating to ORF and cDNA size, relative position of the ORF within the transcript, UTR length and presence of multiple ORFs. Mikado can also utilize blast data to score transcripts based on proteins similarity and to identify and split chimeric transcripts. Optionally, junction confidence data as provided by Portcullis [Portcullis] can be used to improve the assessment. The best-scoring transcripts are selected as the primary transcripts of their respective gene loci; additionally, Mikado can bring back other valid splice variants that are compatible with the primary isoform.

Mikado uses GTF or GFF files as mandatory input. Non-mandatory but highly recommended input data can be generated by obtaining a set of reliable splicing junctions with Portcullis, by locating coding ORFs on the transcripts using Transdecoder, and by obtaining homology information through BLASTX [Blastplus].

Our approach is amenable to include sequences generated by de novo Illumina assemblers or reads generated from long read technologies such as PacBio.

Our tool was presented at Genome Science 2016, both as a poster and in a talk during the Bioinformatics showcase session.

Citing

We are currently working on our paper, and we will be releasing a pre-print shortly. In the meantime, if you use Mikado please reference our github page: https://github.com/lucventurini/mikado

Availability and License

Open source code available on github: https://github.com/lucventurini/mikado

This documentation is hosted publicly on read the docs: https://mikado.readthedocs.org/en/latest/

Mikado is available under GNU LGLP V3.

Acknowledgements

Mikado has greatly benefitted from the public libraries, in particular the [NetworkX] library, Scipy and Numpy ([Scipy], [Numpy]), Intervaltree [PYinterval], and the BX library for a Cython implementation of interval trees. Mikado has also been constantly optimised using Snakeviz, a tool which proved invaluable during the development process.

Credits

  • Luca Venturini (The software architect and developer)
  • Shabhonam Caim (Primary tester and analytic consultancy)
  • Daniel Mapleson (Developer of PortCullis and of the Daijin pipeline)
  • Gemy Kaithakottil (Tester and analytic consultancy)
  • David Swarbreck (Annotation guru and ideator of the pipeline)

Contents

Introduction

python_badge snake_badge

Numerous algorithms have been proposed to analyse RNA-Seq data, both in terms of aligning the reads to a reference genome ([TopHat2], [STAR], [Hisat]) or to assemble them to infer the sequence and structure of the original molecules present in the sample. The latter phase can be performed either by using alignment data against the reference genome ([Cufflinks], [StringTie], [Class2], [Trinity]) or in the absence of such information ([Trinity], [Oases], [Bridger], ). Each of these methods has to contend with numerous sources of variability in the RNA-Seq data:
  • alternative splicing events at the same locus
  • extremely variable expression and therefore sequencing depth at different loci
  • presence of orthologous genes with very similar cDNAs.
  • tandem duplicated genes which can be artefactually reconstructed as a single, fused entity

Multiple assessments of these methods on real and simulated data. In particular, the RGASP consortium promoted a competition among some of the most popular methods in 2012 [RGASP]; the results showed that the accuracy of the methods was dependent on the input data, the expression levels of the genes, and on the species under analysis. A different concern regards the completeness of each of the assembly methods, ie whether methods that are not necessarily the best across all gene loci might nonetheless be capable of outperforming other competitors for specific contigs. Recently, approaches have been proposed to integrate multiple RNA-Seq assemblies by inferring their protein-coding content and collapsing them accordingly [EviGeneTobacco] or to determine the best proposed assembly using multiple measures related to how the transcript compares to the input RNA-Seq reads ([Transrate], [RSEMeval]).

A different but related line of research has focused on how to integrate data coming from multiple samples. While some researchers advocate for merging together the input data and assembling it [Trinity], others have developed methods to integrate multiple assemblies into a single coherent annotation ([CuffMerge], Stringtie-merge). Another common approach relies on a third party tool to integrate multiple assemblies together with other data - typically protein alignments and ab initio predictions - to select for the best model. The latter approach has been chosen by some of the most popular pipelines for genome annotation in the last years ([EVM], [Maker]).

Our tool, Mikado, contributes to this area of research by proposing a novel algorithm to integrate multiple transcript assemblies, leveraging additional data regarding the position of ORFs (generally derived with Transdecoder [Trinity]), sequence similarity (derived with BLAST [Blastplus]) and reliable splicing junctions (generally derived using Portcullis [Portcullis]). Through the combination of these input data sources, we are capable of identifying artefacts - especially gene fusions and small fragments - and retrieve the original transcripts in each locus with high accuracy and sensitivity.

An important caveat is that our approach does not look for the best candidate in terms of the input data, as Transrate [Transrate] or RSEM-Eval [RSEMeval] do. Rather, our approach is more similar to EvidentialGene [EviGeneTobacco] as Mikado will try to find the candidates most likely to represent the isoforms which would be annotated as canonical by a human annotator. Biological events such as intron retentions or trans-splicing that might be present in the sample are explicitly selected against, in order to provide a clean annotation of the genome.

Overview

Mikado analyses an ensemble of RNA-Seq assemblies by looking for overlapping transcripts based on their genomic position. Such clusters of genes, or superloci, are then analysed to locate the gene loci that will form the final annotation. In order, the steps of the pipeline are as follows:

  1. In the first step, mikado prepare will combine assemblies from multiple sources into a single coherent, filtered dataset.
  2. In the second step, mikado serialise will collect data from multiple sources (Portcullis, Transdecoder, BLASTX) and load it into a database for fast integration with the input data.
  3. In the final step, mikado pick will integrate the prepared transcripts with the serialised additional data to choose the best transcripts.

By far, the most important source of data for Mikado is Transdecoder, as the algorithms for finding and splitting chimeras depend on the detection of the ORFs on the transcript sequences. Please refer to the Algorithms section for more details on how Mikado operates.

Using these multiple data sources, and its distinctive iterative method to identify and disentangle gene loci, Mikado is capable of bringing together methods with very different results, and interpret the composition of a locus similarly to how a manual annotator would. This leads to cleaned RNA-Seq annotations that can be used as the basis of the downstream analysis of choice, such as eg a hint-based ab initio annotation with Augustus [Augustus] or Maker2 [Maker].

Mikado in action: recovering fused gene loci

_images/locus_example.jpeg

Using data from the ORFs, Mikado (in red) is capable of identifying and breaking artefactual gene fusions found by the assemblers (in green) in this locus. This allows to report the correct gene loci, even when no method was capable of retrieving all the loci in isolation, and some of the correct genes were not present at all. Coding sections of Mikado models are in dark red, UTR segments are in pale red. The reference annotation is coloured in blue, with the same colour scheme - dark for coding regions, pale for UTRs.

The command-line interface

The Mikado suite provides two commands: mikado and daijin. The former provides access to the main functionalities of the suite:

$ mikado --help
usage: Mikado [-h] {configure,prepare,serialise,pick,compare,util} ...

Mikado is a program to analyse RNA-Seq data and determine the best transcript
for each locus in accordance to user-specified criteria.

optional arguments:
  -h, --help            show this help message and exit

Components:
  {configure,prepare,serialise,pick,compare,util}
                        These are the various components of Mikado:
    configure           This utility guides the user through the process of
                        creating a configuration file for Mikado.
    prepare             Mikado prepare analyses an input GTF file and prepares
                        it for the picking analysis by sorting its transcripts
                        and performing some simple consistency checks.
    serialise           Mikado serialise creates the database used by the pick
                        program. It handles Junction and ORF BED12 files as
                        well as BLAST XML results.
    pick                Mikado pick analyses a sorted GTF/GFF files in order
                        to identify its loci and choose the best transcripts
                        according to user-specified criteria. It is dependent
                        on files produced by the "prepare" and "serialise"
                        components.
    compare             Mikado compare produces a detailed comparison of
                        reference and prediction files. It has been directly
                        inspired by Cufflinks's cuffcompare and ParsEval.
    util                Miscellaneous utilities

Each of these subcommands is explained in detail in the Usage section.

daijin instead provides the interface to the Daijin pipeline manager, which manages the task of going from a dataset of multiple reads to the Mikado final picking. This is its interface:

$ daijin --help

usage: A Directed Acyclic pipeline for gene model reconstruction from RNA seq data.
        Basically, a pipeline for driving Mikado. It will first align RNAseq reads against
        a genome using multiple tools, then creates transcript assemblies using multiple tools,
        and find junctions in the alignments using Portcullis.
        This input is then passed into Mikado.
       [-h] {configure,assemble,mikado} ...

optional arguments:
  -h, --help            show this help message and exit

Pipelines:
  {configure,assemble,mikado}
                        These are the pipelines that can be executed via
                        daijin.
    configure           Creates the configuration files for Daijin execution.
    assemble            A pipeline that generates a variety of transcript
                        assemblies using various aligners and assemblers, as
                        well a producing a configuration file suitable for
                        driving Mikado.
    mikado              Using a supplied configuration file that describes all
                        input assemblies to use, it runs the Mikado pipeline,
                        including prepare, BLAST, transdecoder, serialise and
                        pick.

Installation

System requirements

Mikado requires CPython 3.4 or later to function (Python2 is not supported). Additionally, it requires a functioning installation of one among SQLite, PostgreSQL and MySQL. Mikado has additionally the following python dependencies:

wheel>=0.28.0
pyyaml
jsonschema
cython>=0.25
numpy
networkx>=1.10
sqlalchemy>=1
sqlalchemy_utils
biopython>=1.66
intervaltree
nose
pyfaidx
scikit-learn>=0.17.0
scipy>=0.15.0
frozendict
python-magic
drmaa
snakemake
docutils!=0.13.1
tabulate
ujson

Mikado can run with little system requirements, being capable of analysing human data with less than 4GB of RAM in all stages. It benefits from the availability of multiple processors, as many of the steps of the pipeline are parallelised.

Mikado is at its core a data integrator. The Daijin pipeline has been written to facilitate the creation of a functioning workflow. To function, it requires Snakemake [Snake] and the presence of at least one supported RNA-Seq aligner and one supported transcript assembler. If you are planning to execute it on a cluster, we do support job submission on SLURM, LSF and PBS clusters, either in the presence or absence of DRMAA.

Download

Mikado is available on PyPI, so it is possible to install it with

pip3 install mikado

The source for the latest release on PyPI can be obtained with

pip3 download mikado

As the package contains some core Cython components, it might be necessary to download and compile the source code instead of relying on the wheel.

Alternatively, Mikado can be installed from source by obtaining it from our GitHub repository. Either download the latest release or download the latest stable development snapshot with

git clone https://github.com/lucventurini/mikado.git; cd mikado

If you desire, the unstable development version can be obtained with the command

git checkout development

in the Git directory. Please note that at the time of this writing development proceeds quite rapidly.

Building and installing from source

If you desire to install Mikado from source, this can be achieved with

python3 setup.py bdist_wheel

Followed by

pip3 install dist/*whl

..note:
If you want to update your installation of Mikado, the command to be executed is pip install -U dist/*whl
Testing the installed module

It is possible to test whether Mikado has been built successfully by opening a python3 interactive session and digiting:

>>  import Mikado
>>  Mikado.test()

This will run all the tests included in the suite. Although code coverage is not perfect yet, it is at 70% for the whole package and over 80% for the core components.

Tutorial

This tutorial will guide you through a simple analysis of Mikado, using a small amount of data coming from an experiment on Arabidopsis thaliana. RNA-Seq data was obtained from study PRJEB7093 on ENA, aligned with STAR [STAR] against the TAIR10 reference genome, and assembled with four different programs. For this small example, we are going to focus on a small genomic region: Chr5, from 26,575,364 to 26,614,205.

During this analysis, you will require the following files:

All of this data can also be found in the sample_data directory of the Mikado source.

You will also require the following software:

  • a functioning installation of SQLite.
  • a functioning version of BLAST+ [Blastplus].
Creating the configuration file for Mikado

In the first step, we need to create a configuration file to drive Mikado. To do so, we will first create a tab-delimited file describing our assemblies (class.gtf, cufflinks.gtf, stringtie.gtf, trinity.gff3):

class.gtf   cl      True
cufflinks.gtf       cuff    True
stringtie.gtf       st      True    1
trinity.gff3        tr      False   -0.5

In this file, the three fields define the following:

  1. The file location and name (if no folder is specified, Mikado will look for each file in the current working directory)
  2. An alias associated with the file, which has to be unique
  3. A binary flag (True / False) indicating whether the assembly is strand-specific or not
  4. An optional fourth field which defines a score associated with that sample. All transcripts associated with the label will have their score corrected by the value on this field. So eg. in this example all Stringtie models will receive an additional point, and all Trinity models will be penalised by half a point. Class and Cufflinks have no special bonus or malus associated with them.

Then, we will decompress the chromosome FASTA file:

gunzip chr5.fas.gz  # This will create the file chr5.fas

Finally, we will create the configuration file itself using mikado configure:

mikado configure --list list.txt --reference chr5.fas --mode permissive --scoring plants.yaml  --copy-scoring plants.yaml --junctions junctions.bed -bt uniprot_sprot_plants.fasta configuration.yaml

This will create a configuration.yaml file with the parameters that were specified on the command line. This is simplified configuration file, containing all the necessary parameters for the Mikado run. It will also copy the plants.yaml file from the Mikado installation to your current working directory. The parameters we used for the command line instruct Mikado in the following ways:

  • –list list.txt: this part of the command line instructs Mikado to read the file we just created to understand where the input files are and how to treat them.
  • –reference chr5.fas: this part of the command line instructs Mikado on the location of the genome file.
  • –mode permissive: the mode in which Mikado will treat cases of chimeras. See the documentation for details.
  • –junctions junctions.bed: this part of the command line instructs Mikado to consider this file as the source of reliable splicing junctions.
  • -bt uniprot_sprot_plants.fasta: this part of the command line instructs Mikado to consider this file as the BLAST database which will be used for deriving homology inormation.

Hint

The –copy-scoring argument is usually not necessary, however, it allows you to easily inspect the scoring file we are going to use during this run.

Mikado prepare

The subsequent step involves running mikado prepare to create a sorted, non-redundant GTF with all the input assemblies. As we have already created a configuration file with all the details regarding the input files, this will require us only to issue the command:

mikado prepare --json-conf configuration.yaml

This command will create three files:

  1. mikado_prepared.gtf: one of the two main output files. This is a sorted, non-redundant GTF containing the transcripts from the four input GTFs

  2. mikado_prepared.fasta: a FASTA file of the transcripts present in mikado_prepared.gtf.

  3. prepare.log: the log of this step. This should look like the following, minus the timestamps:

    2016-08-10 13:53:58,443 - prepare - prepare.py:67 - INFO - setup - MainProcess - Command line: /usr/users/ga002/venturil/py351/bin/mikado prepare --json-conf configuration.yaml
    2016-08-10 13:53:58,967 - prepare - prepare.py:206 - INFO - perform_check - MainProcess - Finished to analyse 95 transcripts (93 retained)
    2016-08-10 13:53:58,967 - prepare - prepare.py:405 - INFO - prepare - MainProcess - Finished
    

At the end of this phase, you should have 93 candidate transcripts, as 2 were redundant.

BLAST of the candidate transcripts

Although it is not strictly necessary, Mikado benefits from integrating homology data from BLAST. Mikado requires this data to be provided either in XML or ASN format (in the latter case, blast_formatter will be used to convert it in-memory to XML).

To create this file, we will proceed as follows:

  1. Uncompress the SwissProt database:

    gzip -dc uniprot_sprot_plants.fasta.gz > uniprot_sprot_plants.fasta
    
  2. Prepare the database for the BLAST:

    makeblastdb -in uniprot_sprot_plants.fasta -dbtype prot -parse_seqids > blast_prepare.log
    
  3. Execute the BLAST, asking for XML output, and compress it to limit space usage.

    blastx -max_target_seqs 5 -num_threads 10 -query mikado_prepared.fasta -outfmt 5 -db uniprot_sprot_plants.fasta -evalue 0.000001 2> blast.log | sed '/^$/d' | gzip -c - > mikado.blast.xml.gz
    

This will produce the mikado.blast.xml.gz file, which contains the homology information for the run.

Mikado serialise

This step involves running mikado serialise to create a SQLite database with all the information that mikado needs to perform its analysis. As most of the parameters are already specified inside the configuration file, the command line is quite simple:

mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs mikado.bed --blast_targets

After mikado serialise has run, it will have created two files:

  1. mikado.db, the SQLite database that will be used by pick to perform its analysis.
  2. serialise.log, the log of the run.

If you inspect the SQLite database mikado.db, you will see it contains six different tables:

$ sqlite3 mikado.db
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
chrom      hit        hsp        junctions  orf        query      target

These tables contain the information coming, in order, from the genome FAI, the BLAST XML, the junctions BED file, the ORFs BED file, and finally the input transcripts and the proteins. For more details on the database structure, please refer to the section on this step in this documentation.

Mikado pick

Finally, during this step mikado pick will integrate the data present in the database with the positional and structural data present in the GTF file to select the best transcript models. The command line to be issued is the following:

mikado pick --json-conf configuration.yaml --subloci_out mikado.subloci.gff3

At this step, we have to specify only some parameters for pick to function:

  • json-conf: the configuration file. This is the only compulsory option.
  • subloci_out: the partial results concerning the subloci step during the selection process will be written to mikado.subloci.gff3.

mikado pick will produce the following output files:

  • mikado.loci.gff3, mikado.loci.metrics.tsv, mikado.loci.scores.tsv: the proper output files. These contain the location of the selected transcripts, their metrics, and their scores. Please see this section for details.
  • mikado.subloci.gff3, mikado.subloci.metrics.tsv, mikado.subloci.scores.tsv: these files contain the same type of information as those above, but for the subloci stage. As such, all the transcripts in the input files are represented, not just those that are going to be selected as the best.
  • mikado_pick.log: the log file for this operation.
Comparing files with the reference

Finally, we can compare our files to the original reference annotation, and see how our results are compared to those. To do so, we will use Mikado compare. The first step is to index the reference annotation to make the comparisons faster:

mikado compare -r reference.gff3 --index

This will create a new file, reference.gff3.midx. If you inspect with eg. zless, you will notice it is in essence a compressed JSON file, describing the locations and components of each gene on the annotation. Now that we have indexed the reference, we can perform the comparisons we are interested in:

  1. Reference vs. the input transcripts:
mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log;
  1. Reference vs. the subloci stage:
mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log;
  1. Reference vs the final output:
mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log;

Each of these comparisons will produce three files:

  • a tmap file, detailing the best match in the reference for each of the query transcripts;
  • a refmap file, detailing the best match among the query transcripts for each of the reference transcripts;
  • a stats file, summarising the comparisons.

The stats file for the input GTF should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log
7 reference RNAs in 5 genes
93 predicted RNAs in  64 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.97  29.39  45.00
            Exon level (stringent): 68.09  18.60  29.22
              Exon level (lenient): 90.91  31.25  46.51
                      Intron level: 94.74  45.57  61.54
                Intron chain level: 16.67  1.59  2.90
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 14.29  1.08  2.00
  Transcript level (>=80% base F1): 14.29  1.08  2.00
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 20.00  1.56  2.90
        Gene level (>=80% base F1): 20.00  1.56  2.90

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 1
             Matched intron chains: 1
   Matching monoexonic transcripts: 0
    Matched monoexonic transcripts: 0
        Total matching transcripts: 1
         Total matched transcripts: 1

          Missed exons (stringent): 15/47  (31.91%)
           Novel exons (stringent): 140/172  (81.40%)
            Missed exons (lenient): 4/44  (9.09%)
             Novel exons (lenient): 88/128  (68.75%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 43/79  (54.43%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/93  (25.81%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 21/64  (32.81%)

For the subloci file, where we still have all the transcripts but we have split obvious chimeras, it should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log
7 reference RNAs in 5 genes
105 predicted RNAs in  26 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.96  29.24  44.83
            Exon level (stringent): 70.21  19.08  30.00
              Exon level (lenient): 88.89  32.00  47.06
                      Intron level: 94.74  46.75  62.61
                Intron chain level: 33.33  3.17  5.80
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  9.52  14.29
  Transcript level (>=80% base F1): 42.86  11.43  18.05
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  7.69  12.90
        Gene level (>=80% base F1): 60.00  11.54  19.35

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 3
             Matched intron chains: 2
   Matching monoexonic transcripts: 9
    Matched monoexonic transcripts: 1
        Total matching transcripts: 12
         Total matched transcripts: 3

          Missed exons (stringent): 14/47  (29.79%)
           Novel exons (stringent): 140/173  (80.92%)
            Missed exons (lenient): 5/45  (11.11%)
             Novel exons (lenient): 85/125  (68.00%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 41/77  (53.25%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/105  (22.86%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 13/26  (50.00%)

A marked improvement can already be seen - we have now 105 transcripts instead of 93, and the total number of matching transcripts has increased from 1 to 3. Precision is still poor, however, as we have not discarded any transcript yet. Moreover, we have redundancy - 9 transcripts match the same monoexonic gene, and 3 transcripts match 2 intron chains in the reference. Finally, the comparison against the proper output (mikado.loci.gff3) should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in  8 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 85.74  64.73  73.77
            Exon level (stringent): 63.83  42.86  51.28
              Exon level (lenient): 80.00  52.94  63.72
                      Intron level: 89.47  59.65  71.58
                Intron chain level: 33.33  14.29  20.00
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  13.33  18.18
  Transcript level (>=80% base F1): 42.86  20.00  27.27
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  25.00  30.77
        Gene level (>=80% base F1): 60.00  37.50  46.15

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 2
             Matched intron chains: 2
   Matching monoexonic transcripts: 1
    Matched monoexonic transcripts: 1
        Total matching transcripts: 3
         Total matched transcripts: 3

          Missed exons (stringent): 17/47  (36.17%)
           Novel exons (stringent): 40/70  (57.14%)
            Missed exons (lenient): 9/45  (20.00%)
             Novel exons (lenient): 32/68  (47.06%)
                    Missed introns: 4/38  (10.53%)
                     Novel introns: 23/57  (40.35%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 6/15  (40.00%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 2/8  (25.00%)

After selecting the best transcripts in each locus, Mikado has discarded most of the incorrect transcripts while retaining most of the correct information; this can be seen in the increase in precision at eg. the nucleotide level (from 30% to 65%). The number of genes has also decreased, as Mikado has discarded many loci whose transcripts are just UTR fragments of neighbouring correct genes.

Analysing the tutorial data with Snakemake

The workflow described in this tutorial can be executed automatically using Snakemake [Snake] with this Snakefile. Just execute:

snakemake

in the directory where you have downloaded all of the tutorial files. In graph representation, this is how the pipeline looks like:

Tutorial/snakemake_dag.svg

Tutorial for Daijin

This tutorial will guide you through the task of configuring and running the whole Daijin pipeline on a A. thaliana dataset, using one aligner (HISAT) and two assemblers (Stringtie and CLASS2) as chosen methods. A modern desktop computer with a multicore processor and 4GB of RAM or more should suffice to execute the pipeline.

Overview

The tutorial will guide you through the following tasks:

  1. Configuring Daijin to analyse the chosen data
  2. Running the alignment and assemblies using daijin assemble
  3. Running the Mikado analysis using daijin mikado
  4. Comparing the results against the reference transcriptome
Required software

Mikado should be installed and configured properly (see our installation instructions). Additionally, you should have the following software tools at disposal (between brackets is indicated the version used at the time of the writing):

  • BLAST+ (v2.3.0)
  • TransDecoder (v3.0.0)
  • Portcullis (v0.17.2)
  • HISAT2 (v2.0.4)
  • Stringtie (v1.2.4)
  • CLASS2 (v2.12)
  • SAMtools (v1.2)
Input data

Throughout this tutorial, we will use data coming from release 32 of EnsEMBL and from the PRJEB7093 experiment on ENA. In particular, we will need:

Uncompress the genome, the annotation GFF3 and the Uniprot DAT files using GUNZip. The protein database must be converted into FASTA format, and we should remove the A. thaliana proteins to avoid biasing our results. To do so, use the script remove_from_embl.py (included in Mikado; please include the quotes in the command, or the script will fail!):

remove_from_embl.py -o "Arabidopsis thaliana" uniprot_sprot_plants.dat uniprot_sprot_plants.fasta

Hint

it is not necessary nor advisable to remove proteins from the database in a real experiment. We are doing so here only to avoid biasing the Mikado results. You can convert the DAT files to FASTA using the same script but using “NULL” in place of “Arabidopsis thaliana”.

It is possible to have a feel for the annnotation of this species - the size of its genes and transcripts, the average number of exons per transcript, etc - by using mikado util stats; just issue the following command:

mikado util stats Arabidopsis_thaliana.TAIR10.32.gff3 Arabidopsis_thaliana.TAIR10.32.gff3.stats

The file Arabidopsis_thaliana.TAIR10.32.gff3.stats contains the following information:

Stat Total Average Mode Min 1% 5% 10% 25% Median 75% 90% 95% 99% Max
Number of genes 33602 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of genes (coding) 27397 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of monoexonic genes 11079 NA NA NA NA NA NA NA NA NA NA NA NA NA
Transcripts per gene 41671 1.24 1 1 1 1 1 1 1 1 2 2 4 10
Coding transcripts per gene 35359 1.05 1 0 0 0 0 1 1 1 2 2 4 10
CDNA lengths 64867032 1,556.65 72 22 73 255 463 835 1,361 1,963 2,835 3,614 5,517 16,347
CDNA lengths (mRNAs) 54248365 1,534.22 357 22 144 371 555 897 1,383 1,919 2,664 3,268 4,826 16,347
CDS lengths 43498974 1,043.87 0 0 0 0 0 386 903 1,449 2,154 2,724 4,238 16,182
CDS lengths (mRNAs) NA 1,043.87 0 0 0 0 0 386 903 1,449 2,154 2,724 4,238 16,182
CDS/cDNA ratio NA 78.44 100.0 2 35 48 56 68 80 92 100 100 100 100
Monoexonic transcripts 11494 1,377.97 72 22 71 77 136 465 972 1,841 3,110 4,335 5,949 15,195
MonoCDS transcripts 7367 862.50 357 22 99 138 192 357 675 1,206 1,791 2,136 2,822 6,885
Exons per transcript 217183 5.21 1 1 1 1 1 1 3 7 12 15 24 79
Exons per transcript (mRNAs) 2699 5.86 1 1 1 1 1 2 4 8 13 16 25 79
Exon lengths NA 298.67 72 1 32 51 63 89 150 314 615 990 2,400 15,195
Exon lengths (mRNAs) NA 261.83 72 1 32 51 63 89 147 300 559 857 1,736 7,761
Intron lengths NA 165.84 87 8 67 74 78 86 100 168 334 461 849 57,631
Intron lengths (mRNAs) NA 164.69 87 8 67 75 78 86 100 168 333 458 838 11,602
CDS exons per transcript 2308 4.73 1 0 0 0 0 1 3 7 11 14 24 78
CDS exons per transcript (mRNAs) 2308 5.57 1 1 1 1 1 2 4 8 12 15 24 78
CDS exon lengths 43498974 220.92 72 1 19 45 58 81 127 228 458 735 1,572 7,761
CDS Intron lengths 25183403 155.90 84 7 66 73 77 84 98 155 307 427 789 10,233
5’UTR exon number 35359 0.98 1 0 0 0 0 1 1 1 2 2 3 12
3’UTR exon number 35359 0.87 1 0 0 0 0 1 1 1 1 2 2 15
5’UTR length 4119344 116.50 0 0 0 0 0 11 81 163 280 378 609 3,214
3’UTR length 6630047 187.51 0 0 0 0 0 77 181 257 348 434 755 3,164
Stop distance from junction NA 7.62 0 0 0 0 0 0 0 0 0 10 213 2,286
Intergenic distances NA 1,429.33 254;260 -57,916 -1,251 -42 68 279 822 1,899 3,619 4,996 8,734 72,645
Intergenic distances (coding) NA 2,163.73 187 -10,136 -571 -34 70 284 866 2,143 4,500 6,878 20,134 490,690

From this summary it is quite apparent that the A. thaliana genome preferentially encodes multiexonic transcripts with a median CDS/UTR ratio of approximately 80%. Each transcript has on average 5 exons (median 3), and intron lengths are generally short - 165 bps on average and 100 as median, with a maximum value of ~11,000 bps for mRNAs. It is also a compact genome, with a median intergenic distance under 1 kbps and a modal distance of only 250 bps.

Step 1: configuring Daijin

The first task is to create a configuration file for Daijin using daijin configure. On the command line, we have to configure the following:

  • name of the configuration file (daijin.yaml)
  • number of threads per process (5)
  • reference genome and the name of the species (without spaces or non-ASCII/special characters)
  • reads to use, with their strandedness and the name of the sample
  • aligners (HISAT2) and assemblers (Stringtie, CLASS2) to use
  • output directory (athaliana)
  • the scoring file to use for Mikado pick; we will ask to copy it in-place to have a look at it (plants.yaml)
  • the protein database for homology searches for Mikado (uniprot_sprot_plants.fasta)
  • flank: as A. thaliana has a quite compact genome, we should decrease the maximum distance for grouping together transcripts. We will decrease from 1kbps (default) to 250.
  • (optional) the scheduler to use for the cluster (we will presume SLURM, but we also support LSF and PBS)
  • (optional) name of the cluster configuration file, which will have to be edited manually.

This can all be specified with the following command line:

daijin configure \
  -o daijin.yaml \
  --flank 250 \
  --threads 5 \
  --genome Arabidopsis_thaliana.TAIR10.dna.toplevel.fa --name "Athaliana" \
  -od athaliana \
  -r1 ERR588044_1.fastq.gz -r2 ERR588044_2.fastq.gz --strandedness fr-secondstrand -s ERR588044 \
  -al hisat -as stringtie class \
  --scoring plants.yaml --copy-scoring plants.yaml \
  --prot-db uniprot_sprot_plants.fasta \
  --scheduler SLURM \
  -c slurm.yaml;

This will produce a file, daijin.yaml, which should specify the same parameters as this preformed example.

If you are executing this program on a cluster, you will have to modify the “load” section and specify for each software which operations should be performed to make the tool available to the submitted script (eg. “source mikado-1.0.0” to make Mikado available). For example, this is how our load section looks like after editing:

load:
    #  Commands to use to load/select the versions of the programs to use. Leave an empty
    #  string if no loading is necessary.
    blast: 'source blast-2.3.0'
    class: 'source class-2.12'
    cufflinks: ''
    gmap: ''
    hisat: 'source HISAT-2.0.4'
    mikado: 'source mikado-devel'
    portcullis: 'source portcullis-0.17.2'
    samtools: 'source samtools-1.2'
    star: ''
    stringtie: 'source stringtie-1.2.4'
    tophat: ''
    transdecoder: 'source transdecoder-3.0.0'
    trinity: ''

Also, if you are operating on a cluster, you might want to edit the cluster configuration file “slurm.conf”. This is how it appears:

__default__:
    threads: 4
    memory: 10000
    queue: tgac-medium
asm_trinitygg:
    memory: 20000

The most important parameter to modify is the queue to use - set it according to the configuration of your own cluster.

Step 2: running the assemble part

Now that we have created a proper configuration file, it is time to launch Daijin assemble and inspect the results. Issue the command:

daijin assemble -c slurm.yaml daijin.yaml

After checking that the configuration file is valid, Daijin will start the alignment and assembly of the dataset. On a normal desktop computer, this should take ~2 hours. Before launching the pipeline, you can obtain a graphical representation of the steps with:

daijin assemble -c slurm.yaml --dag daijin.yaml | dot -Tsvg > assemble.svg

You can also ask Daijin to display the steps to be executed, inclusive of their command lines, by issuing the following command:

daijin assemble -c slurm.yaml --dryrun daijin.yaml

When Daijin is finished, have a look inside the folder athaliana/3-assemblies/output/; you will find the following two GTF files:

  • class-0-hisat–0.gtf
  • stringtie-0-hisat–0.gtf

These are standard GTF files reporting the assembled transcripts for each method. We can have a feel for how they compare with our reference annotation by, again, using mikado util stats. Conveniently, Daijin has already performed this analysis for us, and the files will be present in the same folder:

  • class-0-hisat-ERR588044-0.gtf.stats
  • stringtie-0-hisat-ERR588044-0.gtf.stats

These are the statistics for our CLASS run:

Stat Total Average Mode Min 1% 5% 10% 25% Median 75% 90% 95% 99% Max
Number of genes 29029 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of genes (coding) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of monoexonic genes 3476 NA NA NA NA NA NA NA NA NA NA NA NA NA
Transcripts per gene 35276 1.20 1 1 1 1 1 1 1 1 2 2 4 20
Coding transcripts per gene 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDNA lengths 34375468 974.47 202 86 140 202 244 387 746 1,318 1,997 2,524 3,776 20,077
CDNA lengths (mRNAs) 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS lengths 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS lengths (mRNAs) NA 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS/cDNA ratio NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Monoexonic transcripts 3476 627.98 201 86 100 118 148 225 432 862 1,366 1,690 2,367 6,497
MonoCDS transcripts 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Exons per transcript 166661 4.72 2 1 1 1 2 2 3 6 10 13 19 39
Exons per transcript (mRNAs) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Exon lengths NA 206.26 101 2 30 50 62 87 123 215 412 652 1,356 16,199
Exon lengths (mRNAs) NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Intron lengths NA 203.58 87 20 68 75 79 86 101 176 353 501 1,441 9,973
Intron lengths (mRNAs) NA NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS exons per transcript 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS exons per transcript (mRNAs) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS exon lengths 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS Intron lengths 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
5’UTR exon number 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
3’UTR exon number 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
5’UTR length 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
3’UTR length 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Stop distance from junction NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Intergenic distances NA 2,489.09 -200 -21,156 -4,688 -1,088 -505 -122 600 2,526 6,343 10,431 26,002 787,832
Intergenic distances (coding) NA NA NA NA NA NA NA NA NA NA NA NA NA NA

and these for our Stringtie run:

Stat Total Average Mode Min 1% 5% 10% 25% Median 75% 90% 95% 99% Max
Number of genes 40566 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of genes (coding) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Number of monoexonic genes 11898 NA NA NA NA NA NA NA NA NA NA NA NA NA
Transcripts per gene 55055 1.33 1 1 1 1 1 1 1 1 2 3 5 15
Coding transcripts per gene 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDNA lengths 71035478 1,290.26 200 200 210 248 308 535 1,053 1,753 2,556 3,180 4,709 23,560
CDNA lengths (mRNAs) 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS lengths 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS lengths (mRNAs) NA 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS/cDNA ratio NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Monoexonic transcripts 12251 635.23 200 200 203 215 229 292 452 794 1,295 1,660 2,502 15,487
MonoCDS transcripts 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Exons per transcript 280409 5.09 1 1 1 1 1 2 4 7 11 15 22 57
Exons per transcript (mRNAs) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Exon lengths NA 253.33 72 2 27 49 62 87 145 293 548 811 1,585 19,610
Exon lengths (mRNAs) NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Intron lengths NA 181.30 87 20 69 75 79 86 100 167 335 467 967 9,958
Intron lengths (mRNAs) NA NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS exons per transcript 0 0.00 0 0 0 0 0 0 0 0 0 0 0 0
CDS exons per transcript (mRNAs) 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS exon lengths 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
CDS Intron lengths 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
5’UTR exon number 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
3’UTR exon number 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
5’UTR length 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
3’UTR length 0.0 NA NA NA NA NA NA NA NA NA NA NA NA NA
Stop distance from junction NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Intergenic distances NA 1,153.56 54 -21,384 -7,051 -3,616 -2,547 -1,094 272 1,729 4,661 7,504 19,125 574,775
Intergenic distances (coding) NA NA NA NA NA NA NA NA NA NA NA NA NA NA

From this quick analysis, it looks like both assemblers have reconstructed reasonable transcripts, on average. We can try to gauge how similar either assembly is to known models by using Mikado compare. To do so, issue the following commands:

# First index the reference GFF3
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 --index;
# Compare the CLASS2 assembly against the reference
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.log
# Compare the Stringtie assembly against the reference
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare.log

The analysis will produce TMAP, REFMAP and STATS files for each of the two assemblies. This is the report of the stats file for the CLASS2 assembly,

41671 reference RNAs in 33602 genes
35276 predicted RNAs in  29029 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 40.73  80.36  54.06
            Exon level (stringent): 43.80  56.22  49.24
              Exon level (lenient): 65.81  74.31  69.80
                      Intron level: 68.00  86.79  76.25
                Intron chain level: 28.37  26.88  27.61
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 2.30  2.70  2.48
  Transcript level (>=80% base F1): 16.33  19.19  17.64
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 2.77  3.20  2.97
        Gene level (>=80% base F1): 19.43  22.37  20.80

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 8549
             Matched intron chains: 8575
   Matching monoexonic transcripts: 870
    Matched monoexonic transcripts: 883
        Total matching transcripts: 9419
         Total matched transcripts: 9458

          Missed exons (stringent): 95133/169282  (56.20%)
           Novel exons (stringent): 57739/131888  (43.78%)
            Missed exons (lenient): 52381/153221  (34.19%)
             Novel exons (lenient): 34867/135707  (25.69%)
                    Missed introns: 40932/127896  (32.00%)
                     Novel introns: 13234/100198  (13.21%)

                Missed transcripts: 15490/41671  (37.17%)
                 Novel transcripts: 99/35276  (0.28%)
                      Missed genes: 14849/33602  (44.19%)
                       Novel genes: 70/29029  (0.24%)

and this instead for Stringtie:

/tgac/software/testing/bin/core/../..//mikado/devel/x86_64/bin/mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare.log
41671 reference RNAs in 33602 genes
55055 predicted RNAs in  40566 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 55.21  59.42  57.23
            Exon level (stringent): 45.31  38.81  41.81
              Exon level (lenient): 68.12  57.04  62.09
                      Intron level: 72.94  64.23  68.31
                Intron chain level: 42.54  29.94  35.15
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 21.26  18.02  19.50
  Transcript level (>=80% base F1): 34.77  32.56  33.63
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 24.76  22.49  23.57
        Gene level (>=80% base F1): 40.20  40.28  40.24

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 12817
             Matched intron chains: 12854
   Matching monoexonic transcripts: 1827
    Matched monoexonic transcripts: 1836
        Total matching transcripts: 14644
         Total matched transcripts: 14690

          Missed exons (stringent): 92576/169282  (54.69%)
           Novel exons (stringent): 120959/197665  (61.19%)
            Missed exons (lenient): 49435/155043  (31.88%)
             Novel exons (lenient): 79549/185157  (42.96%)
                    Missed introns: 34606/127896  (27.06%)
                     Novel introns: 51958/145248  (35.77%)

                Missed transcripts: 12822/41671  (30.77%)
                 Novel transcripts: 155/55055  (0.28%)
                      Missed genes: 12288/33602  (36.57%)
                       Novel genes: 138/40566  (0.34%)

These comparisons suggest the following:

  1. Most of the transcripts reconstructed by both assemblers are near enough known genes so as not to be categorised as completely novel. Only 70 genes for CLASS2 and 130 genes for Stringtie are in non-annotated genomic loci.
  2. CLASS2 has performed worse than Stringtie for this particular species and samples - the number of reconstructed transcript is lower, and so is the F1 at the transcript and gene level.
  3. Although the number of novel loci is low for both CLASS and Stringtie, the high number of loci
Step 3: running the Mikado steps

Now that we have created the input assemblies, it is time to run Mikado. First of all, let us have a look at the spombe.yaml file, which we have conveniently copied to our current directory:

requirements:
  #expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and (cdna_length.multi, or,  combined_cds_length.multi) and max_intron_length, and, (min_intron_length and proportion_verified_introns_inlocus), or exon_num.mono and (combined_cds_length.mono or cdna_length.mono))]
  expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and  (cdna_length.multi, or,  combined_cds_length.multi) and max_intron_length, and, min_intron_length and proportion_verified_introns_inlocus ) or (exon_num.mono and (combined_cds_length.mono or cdna_length.mono))) ]
  parameters:
    combined_cds_fraction.ncrna: {operator: eq, value: 0}
    combined_cds_fraction.coding: {operator: gt, value: 0.35}
    cdna_length.mono: {operator: gt, value: 200}
    cdna_length.multi: {operator: ge, value: 100}
    combined_cds_length.mono: {operator: gt, value: 100}
    combined_cds_length.multi: {operator: gt, value: 50}
    exon_num.mono: {operator: eq, value: 1}
    exon_num.multi: {operator: gt, value: 1}
    max_intron_length: {operator: le, value: 20000}
    min_intron_length: {operator: ge, value: 5}
    proportion_verified_introns_inlocus: {operator: gt, value: 0}
as_requirements:
  expression: [cdna_length and three_utr_length and five_utr_length and utr_length and suspicious_splicing]
  parameters:
    cdna_length: {operator: ge, value: 200}
    utr_length: {operator: le, value: 2500}
    five_utr_length: {operator: le, value: 2500}
    three_utr_length: {operator: le, value: 2500}
    suspicious_splicing: {operator: ne, value: true}
not_fragmentary:
 expression: [((exon_num.multi and (cdna_length.multi or combined_cds_length.multi)), or, (exon_num.mono and combined_cds_length.mono))]
 parameters:
   is_complete: {operator: eq, value: true}
   exon_num.multi: {operator: gt, value: 1}
   cdna_length.multi: {operator: ge, value: 200}
   combined_cds_length.multi: {operator: gt, value: 150}
   exon_num.mono: {operator: eq, value: 1}
   combined_cds_length.mono: {operator: gt, value: 200}
#   expression: [combined_cds_length]
#   parameters:
#     combined_cds_length: {operator: gt, value: 300}
scoring:
  blast_score: {rescaling: max}
  # snowy_blast_score: {rescaling: max}
  cdna_length: {rescaling: max}
  cds_not_maximal: {rescaling: min}
  cds_not_maximal_fraction: {rescaling: min}
  # exon_fraction: {rescaling: max}
  exon_num: {
    rescaling: max,
    filter: {
    operator: ge,
    value: 3}
  }
  five_utr_length:
    filter: {operator: le, value: 2500}
    rescaling: target
    value: 100
  five_utr_num:
    filter: {operator: lt, value: 4}
    rescaling: target
    value: 2
  end_distance_from_junction:
    filter: {operator: lt, value: 55}
    rescaling: min
  highest_cds_exon_number: {rescaling: max}
  intron_fraction: {rescaling: max}
  is_complete: {rescaling: target, value: true}
  number_internal_orfs: {rescaling: target, value: 1}
  # proportion_verified_introns: {rescaling: max}
  non_verified_introns_num: {rescaling: min}
  proportion_verified_introns_inlocus: {rescaling: max}
  retained_fraction: {rescaling: min}
  retained_intron_num: {rescaling: min}
  selected_cds_fraction: {rescaling: target, value: 0.8}
  selected_cds_intron_fraction: {rescaling: max}
  selected_cds_length: {rescaling: max}
  selected_cds_num: {rescaling: max}
  three_utr_length:
    filter: {operator: le, value: 2500}
    rescaling: target
    value: 200
  three_utr_num:
    filter: {operator: lt, value: 3}
    rescaling: target
    value: 1
  combined_cds_locus_fraction: {rescaling: max}

With this file, we are telling Mikado that we are looking for transcripts with a low number of exons (“exon_num: {rescaling: min}”), with a ratio of CDS/UTR of ~80% (“selected_cds_fraction: {rescaling: target, value: 0.8}”) and good homology support (“blast_score: {rescaling: max}”). We are also rewarding long cDNAs (“cdna_length: {rescaling: max}”) with a good homology to known proteins (“blast_score: {rescaling: max}”) and only one ORF (“cds_not_maximal: {rescaling: min}”; “number_internal_orfs: {rescaling: target, value: 1}”). This set of rules is tailored for compact fungal genomes, where this kind of models is expected at a much higher rate than in other eukaryotes (eg plants or mammals).

Before launching daijin mikado, we can have a look at the BLAST and TransDecoder sections. It is advisable to modify the number of chunks for BLASTX to a high number, eg. 100, if you are using a cluster. Please note that now we are going to use a different configuration file, athaliana/mikado.yaml, which Daijin created as last step in the assembly pipeline.

Issue the command:

daijin mikado -c slurm.yaml athaliana/mikado.yaml

This part of the pipeline should be quicker than the previous stage. After the pipeline is finished, Daijin will have created the final output files in athaliana/5-mikado/pick/. As we requested only for the permissive mode, we only have one output - athaliana/5-mikado/pick/mikado-permissive.loci.gff3. These are basic statistics on this annotation:

File genes monoexonic_genes transcripts transcripts_per_gene transcript_len_mean monoexonic_transcripts exons exons_per_transcript exon_len_mean
mikado-permissive.loci.stats 5633 2915 6054 1.07 1642.08 3095 11850 1.96 838.92

Mikado has created an annotation that is in between those produced by Stringtie and CLASS2. Compared to CLASS2, the Mikado assembly is probably more comprehensive, given the higher number of genes. Half the genes are monoexonic, an improvement compared to CLASS2, and at the same time the number of genes is much lower than in Stringtie, with a higher average cDNA length. These statistics suggest that the Mikado filtered annotation has been able to retain most of the real genes, while discarding many of the fragments present in either assembly. We can verify this by comparing the Mikado results against the reference annotation:

mikado compare -r Schizosaccharomyces_pombe.ASM294v2.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/mikado_prepared.compare -l athaliana/5-mikado/mikado_prepared.compare.log

A cursory look at the STATS file confirms the impression; the Mikado annotation has removed many of the spurious transcripts and is quite more precise than either of the input assemblies, while retaining their sensitivity:

Command line:
/tgac/software/testing/bin/core/../..//mikado/devel/x86_64/bin/mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.log
41671 reference RNAs in 33602 genes
30806 predicted RNAs in  24730 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 53.43  87.95  66.48
            Exon level (stringent): 45.44  58.77  51.26
              Exon level (lenient): 68.71  86.45  76.57
                      Intron level: 73.13  93.43  82.05
                Intron chain level: 43.73  53.48  48.11
      Transcript level (stringent): 0.01  0.01  0.01
  Transcript level (>=95% base F1): 18.87  25.49  21.69
  Transcript level (>=80% base F1): 33.71  45.78  38.83
         Gene level (100% base F1): 0.01  0.02  0.01
        Gene level (>=95% base F1): 22.36  30.35  25.75
        Gene level (>=80% base F1): 39.52  53.95  45.62

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 13173
             Matched intron chains: 13213
   Matching monoexonic transcripts: 1702
    Matched monoexonic transcripts: 1710
        Total matching transcripts: 14875
         Total matched transcripts: 14923

          Missed exons (stringent): 92353/169282  (54.56%)
           Novel exons (stringent): 53962/130891  (41.23%)
            Missed exons (lenient): 48471/154914  (31.29%)
             Novel exons (lenient): 16682/123125  (13.55%)
                    Missed introns: 34362/127896  (26.87%)
                     Novel introns: 6576/100110  (6.57%)

                Missed transcripts: 13373/41671  (32.09%)
                 Novel transcripts: 132/30806  (0.43%)
                      Missed genes: 12824/33602  (38.16%)
                       Novel genes: 120/24730  (0.49%)

Moreover, Mikado models have an ORF assigned to them. We can ask Mikado compare to consider only the coding component of transcripts with the following command line:

mikado compare -eu -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu.log

The statistics file looks as follows:

Command line:
mikado compare -eu -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu.log
41671 reference RNAs in 33602 genes
30806 predicted RNAs in  24730 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 53.00  93.07  67.54
            Exon level (stringent): 60.90  78.70  68.67
              Exon level (lenient): 69.57  88.37  77.85
                      Intron level: 73.44  94.86  82.79
                Intron chain level: 46.39  57.65  51.41
      Transcript level (stringent): 25.84  34.31  29.48
  Transcript level (>=95% base F1): 35.32  47.28  40.43
  Transcript level (>=80% base F1): 39.45  53.03  45.25
         Gene level (100% base F1): 26.78  36.40  30.86
        Gene level (>=95% base F1): 37.63  51.16  43.37
        Gene level (>=80% base F1): 42.29  57.54  48.75

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 13988
             Matched intron chains: 14097
   Matching monoexonic transcripts: 2687
    Matched monoexonic transcripts: 2693
        Total matching transcripts: 16675
         Total matched transcripts: 16790

          Missed exons (stringent): 61443/157150  (39.10%)
           Novel exons (stringent): 25902/121609  (21.30%)
            Missed exons (lenient): 44733/146988  (30.43%)
             Novel exons (lenient): 13456/115711  (11.63%)
                    Missed introns: 32026/120580  (26.56%)
                     Novel introns: 4795/93349  (5.14%)

                Missed transcripts: 13627/41671  (32.70%)
                 Novel transcripts: 141/30806  (0.46%)
                      Missed genes: 13060/33602  (38.87%)
                       Novel genes: 127/24730  (0.51%)

The similarity is quite higher, suggesting that for many models the differences between the Mikado annotation and the reference lies in the UTR component.

We suggest to visualise assemblies with one of the many tools currently at disposal, such as eg WebApollo [Apollo]. Mikado files are GFF3-compliant and can be loaded directly into Apollo or similar tools. GTF files can be converted into proper GFF3 files using the convert utility:

mikado util convert <input gtf> <output GFF3>

Usage

Mikado is composed of four different programs (configure, prepare, serialise, pick) which have to be executed serially to go from an ensemble of different assemblies to the final dataset. In addition to these core programs, Mikado provides a utility to compare annotations, similarly to CuffCompare and ParsEval (compare), and various other minor utilities to perform operations such as extracting regions from a GFF, convert between different gene annotation formats, etc.

Mikado pipeline stages
Mikado configure

This utility prepares the configuration file that will be used throughout the pipeline stages. While the most important options can be set at runtime through the command line, many algorithmic details can be accessed and intervened upon only through the file produced through this command.

Usage

This command will generate a configuration file (in either JSON or YAML format), with the correct configuration for the parameters set on the command line. See the in-depth section on the structure of the configuration file for details.

Command line parameters:

  • full: By default, Mikado configure will output a stripped-down configuration file, with only some of the fields explicitly present. Use this flag to show all the available fields.

Hint

If a parameter is not explicitly present, its value will be set to the available default.

  • external: an external JSON/YAML file from which Mikado should recover parameters. If something is specified both on the command line and in the external configuration file, command line options take precedence.
  • reference: Reference genome FASTA file. Required for the correct functioning of Mikado.
  • gff: list of GFFs/GTFs of the assemblies to use for Mikado, comma separated.
  • labels: optional list of labels to be assigned to each assembly. The label will be also the “source” field in the output of Mikado prepare.
  • strand-specific-assemblies: list of strand specific assemblies among those specified with the gff flag.
  • strand-specific: flag. If set, all assemblies will be treated as strand-specific.
  • list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with the following contents:
    1. Location of the file
    2. label for the file
    3. whether that assembly is strand-specific or not (write True/False)
    4. Optionally, a bonus/malus to be associated to transcripts coming from that assembly.
  • -j, json: flag. If present, the output file will be in JSON rather that YAML format.
  • mode: in which mode Mikado pick will run. See the relative section for further details.
  • scoring: scoring file to be used for selection. At the moment, four configuration files are present:
    • plants
    • human
    • worm
    • insects
  • copy-scoring: this flag specifies a file the scoring file will be copied to. Useful for customisation on new species. The configuration file will account for this and list the location of the copied file as the scoring file to be used for the run.
  • junctions: Reliable junctions derived from the RNA-Seq alignments. Usually obtained through Portcullis, but we also support eg junctions from Tophat2. This file must be in BED12 format.
  • blast_targets: FASTA database of blast targets, for the serialisation stage.

Usage:

$ mikado configure --help
usage: Mikado configure [-h] [--full]
                        [--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}]
                        [--copy-scoring COPY_SCORING] [--strand-specific]
                        [--no-files | --gff GFF | --list LIST]
                        [--reference REFERENCE] [--junctions JUNCTIONS]
                        [-bt BLAST_TARGETS]
                        [--strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES]
                        [--labels LABELS] [--external EXTERNAL]
                        [--mode {nosplit,stringent,lenient,permissive,split}]
                        [-j]
                        [out]

This utility guides the user through the process of creating a configuration file for Mikado.

positional arguments:
  out

optional arguments:
  -h, --help            show this help message and exit
  --full
  --strand-specific     Boolean flag indicating whether all the assemblies are strand-specific.
  --no-files            Remove all files-specific options from the printed configuration file.
                                               Invoking the "--gff" option will disable this flag.
  --gff GFF             Input GFF/GTF file(s), separated by comma
  --list LIST           List of the inputs, one by line, in the form:
                        <file1>  <label>  <strandedness (true/false)>
  --reference REFERENCE
                        Fasta genomic reference.
  --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
                        List of strand-specific assemblies among the inputs.
  --labels LABELS       Labels to attach to the IDs of the transcripts of the input files,
                                separated by comma.
  --external EXTERNAL   External configuration file to overwrite/add values from.
                            Parameters specified on the command line will take precedence over those present in the configuration file.
  --mode {nosplit,stringent,lenient,permissive,split}
                        Mode in which Mikado will treat transcripts with multiple ORFs.
                        - nosplit: keep the transcripts whole.
                        - stringent: split multi-orf transcripts if two consecutive ORFs have both BLAST hits
                                     and none of those hits is against the same target.
                        - lenient: split multi-orf transcripts as in stringent, and additionally, also when
                                   either of the ORFs lacks a BLAST hit (but not both).
                        - permissive: like lenient, but also split when both ORFs lack BLAST hits
                        - split: split multi-orf transcripts regardless of what BLAST data is available.
  -j, --json            Output will be in JSON instead of YAML format.

Options related to the scoring system:
  --scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}
                        Available scoring files.
  --copy-scoring COPY_SCORING
                        File into which to copy the selected scoring file, for modification.

Options related to the serialisation step:
  --junctions JUNCTIONS
  -bt BLAST_TARGETS, --blast_targets BLAST_TARGETS
Anatomy of the configuration file

The guide here describes all voices of the configuration file. However, the configuration created by default by mikado configure is much simpler

Database settings

This section deals with the database settings that will be necessary for the serialisation and picking phases of the pipeline. By default, Mikado will use a SQLite database, but it currently also supports MySQL and PostgreSQL through SQLAlchemy. Fields:

  • db: name of the database to use. In case the database is SQLite, this will be the database file, otherwise it will be the database name.
  • dbtype: one of: * sqlite * mysql * postgresql
  • dbhost: host where the database is located. Required with MySQL and PostgreSQL.
  • dbuser: User of the database. Required with MySQL and PostgreSQL.
  • dbpasswd: Database password. Required with MySQL and PostgreSQL.
  • dbport: Port to access to the database. It defaults to the normal ports for the selected database.
db_settings:
  #  Settings related to DB connection. Parameters:
  #  db: the DB to connect to. Required. Default: mikado.db
  #  dbtype: Type of DB to use. Choices: sqlite, postgresql, mysql. Default: sqlite.
  #  dbhost: Host of the database. Unused if dbtype is sqlite. Default: localhost
  #  dbuser: DB user. Default:
  #  dbpasswd: DB password for the user. Default:
  #  dbport: Integer. It indicates the default port for the DB.
  db: /usr/users/ga002/venturil/workspace/mikado/docs/mikado.db
  dbhost: localhost
  dbpasswd: ''
  dbport: 0
  dbtype: sqlite
  dbuser: ''
Reference settings

This section of the configuration file deals with the reference genome. It specifies two fields:

  • genome: the genome FASTA file. Required.
  • genome_fai: FAI index of the genome. Used by Mikado serialise, it can be inferred if left null.
  • transcriptome: optional annotation file for the genome. Mikado currently ignores this field, but it is used by Daijin to guide some of the RNA-Seq assemblies.
reference:
  #  Options related to the reference genome.
  genome: ''
  genome_fai: ''
  transcriptome: ''
Settings for the prepare stage

This section of the configuration file deals with the prepare stage of Mikado. It specifies the input files, their labels, and which of them are strand specific. The available fields are the following:

  • canonical: this voice specifies the splice site donors and acceptors that are considered canonical for the species. By default, Mikado uses the canonical splice site (GT/AG) and the two semi-canonical pairs (GC/AG and AT/AC). Type: Array of two-element arrays, composed by two-letter strings.
  • keep_redundant: if set to false (default), Mikado will only keep one copy of transcripts that are completely identical.
  • lenient: boolean value. If set to false, transcripts that either only have non-canonical splice sites or have a mixture of canonical junctions on both strands will be removed from the output. Otherwise, they will left in, be properly tagged.
  • minimum_length: minimum length of the transcripts to be kept.
  • procs: number of processors to be used.
  • strand_specific: boolean. If set to true, all input assemblies will be treated as strand-specific, therefore keeping the strand of monoexonic fragments as it was. Multiexonic transcripts will not have their strand reversed even if doing that would mean making some or all non-canonical junctions canonical.
  • strip_cds: boolean. If set to true, the CDS features will be stripped off the input transcripts. This might be necessary for eg transcripts obtained through alignment with GMAP [GMAP].
  • files: this sub-section is the most important, as it contains among other things the locations and labels for the input files. Voices:
    • gff: array of the input files, in GFF or GTF format. Please note that only CDS/exon/UTR features will be considered from these files.
    • labels: optional array of the labels to be assigned to the input files. If non-empty, it must be of the same order and length of the gff array, and be composed of unique elements. The labels will be used in two ways: * as a prefix of the transcripts coming from the corresponding GFF * as the source field assigned to the transcript. This might be of relevance during the picking stage.
    • log: name of the log file.
    • out: name of the GTF output file.
    • out_fasta: name of the corresponding output FASTA file.
    • output_dir: output directory. It will be created if it does not exist already.
    • strand_specific_assemblies: array of the names of the GFF/GTF files that are strand specific. All the file names in this array must also appear in the gff array as well..
prepare:
  #  Options related to the input data preparation.
  #  - files: options relative to the input/output files.
  #  - procs: Number of processes to use.
  #  - strip_cds: whether to remove the CDS from the predictions during preparation.
  #  - lenient: if set to True, invalid transcripts will be only flagged and not removed.
  #  EXPERIMENTAL.
  #  - strand_specific: if set to True, transcripts will be assumed to be in the correct
  #  orientation, no strand flipping or removal
  #  - strand_specific_assemblies: array of input predictions which are to be considered
  #  as strand-specific.
  #    Predictions not in this list will be considered as non-strand-specific.
  #  - canonical: canonical splice sites, to infer the correct orientation.
  canonical:
  - - GT
    - AG
  - - GC
    - AG
  - - AT
    - AC
  files:
    #  Options related to the input and output files.
    #  - out: output GTF file
    #  - out_fasta: output transcript FASTA file
    #  - gff: array of input predictions for this step.
    #  - log: output log. Default: prepare.log
    #  - labels: labels to be associated with the input GFFs. Default: None.
    gff: []
    labels: []
    log: prepare.log
    out: mikado_prepared.gtf
    out_fasta: mikado_prepared.fasta
    output_dir: .
    strand_specific_assemblies: []
  keep_redundant: false
  lenient: false
  minimum_length: 200
  procs: 1
  single: false
  strand_specific: false
  strip_cds: false
Settings for the serialisation stage

This section of the configuration file deals with the serialisation stage of Mikado. It specifies the location of the ORF BED12 files from TransDecoder, the location of the XML files from BLAST, the location of portcullis junctions, and other details important at run time. It has the following voices:

  • discard_definition: boolean. This is used to specify whether we will use the ID or the definition of the sequences when parsing BLAST results. This is important when BLAST data might have a mock, local identifier for the sequence (“lcl|1”) rather than its original ID. :warning: Deprecated since v1 beta 10.
  • force: whether the database should be truncated and rebuilt, or just updated.
  • max_objects: this parameter is quite important when running with a SQLite database. SQLite does not support caching on the disk before committing the changes, so that every change has to be kept in memory. This can become a problem for RAM quite quickly. On the other hand, committing is an expensive operation, and it makes sense to minimise calls as much as possible. This parameter specifies the maximum number of objects Mikado will keep in memory before committing them to the database. The default number, 100,000, should ensure that Mikado runs with less than 1GB memory. Increase it to potentially increase speed at the price of greater memory usage; for example, increasing it to 1,000,000 will cause Mikado to use ~6GB of RAM at its peak usage.
  • max_regression: this parameter is a float comprised between 0 and 1. TransDecoder will sometimes output open ORFs even in the presence of an in-frame start codon. Mikado can try to “regress” along the ORF until it finds one such start codon. This parameter imposes how much Mikado will regress, in percentage of the cDNA length.
  • max_target_seqs: equivalent to the BLAST+ parameter of the same name - it indicates the maximum number of discrete hits that can be assigned to one sequence in the database.
  • procs: number of processors to use. Most important for serialising BLAST+ files.
  • single_thread: boolean, if set to true it will forcibly disable multi-threading. Useful mostly for debugging purposes.
  • files: this sub-section codifies the location of the input files for serialise. It contains the following voices:
    • junctions: array of locations of reliable junction files. These must be in BED12 format.
    • log: log file.
    • orfs: array of locations of ORFs location on the cDNA, as created by eg TransDecoder [Trinity].
    • output_dir: output directory where the log file and the SQLite database will be written to (if SQLite has been chosen as the database type)
    • transcripts: input transcripts. This should be set to be equal to the output of Mikado prepare, ie the “out_fasta” field of the prepare section of the configuration file.
    • xml: this array indicates the location of the BLAST output file. Elements of the array can be:
      • BLAST+ XML files (optionally compressed with gzip)
      • BLAST+ ASN files (optionally compressed with gzip), which will be converted in-memory using blast_formatter
      • a folder containing files of the above types.
serialise:
  #  Options related to serialisation
  #  - force: whether to drop and reload everything into the DB
  #  - files: options related to input files
  #  - max_objects: Maximum number of objects to keep in memory while loading data
  #  into the database
  #  - max_regression: if the ORF lacks a valid start site, this percentage indicates
  #  how far
  #    along the sequence Mikado should look for a good start site. Eg. with a value
  #  of 0.1,
  #    on a 300bp sequence with an open ORF Mikado would look for an alternative in-frame
  #  start codon
  #    in the first 30 bps (10% of the cDNA).
  #  - max_target_seqs: equivalently to BLAST, it indicates the maximum number of
  #  targets to keep
  #    per blasted sequence.
  #  - discard_definition: Boolean. Used to indicate whether Mikado should use the
  #  definition
  #    rather than the ID for BLAST sequences. Necessary as in some instances BLAST
  #  XMLs will have
  #    a mock identifier rather than the original sequence ID (eg lcl|1). Default:
  #  false.
  #  - procs: Number of processors to use. Default: 1.
  #  - single_thread: if true, Mikado prepare will force the usage of a single thread
  #  in this step.
  files:
    blast_targets:
    - ''
    junctions: []
    log: serialise.log
    orfs:
    - ''
    output_dir: .
    transcripts: mikado_prepared.fasta
    xml:
    - ''
  force: false
  max_objects: 100000
  max_regression: 0
  max_target_seqs: 100000
  procs: 1
  single_thread: false

Hint

The most expensive operation in a “Mikado serialise” run is by far the serialisation of the BLAST files. Splitting the input files in multiple chunks, and analysing them separately, allows Mikado to parallelise the analysis of the BLAST results. If a single monolythic XML/ASN file is produced, by contrast, Mikado will be quite slow as it will have to parse it all.

Settings for the pick stage

This section of the configuration file deals with the picking stage of Mikado. It specifies details on how to handle BLAST and ORF data, which alternative splicing events are considered as valid during the final stages of the picking, and other important algorithmic details. The section comprises the following subsections:

  • alternative_splicing: Options related to which AS events are considered as valid for the primary transcript in a locus.
  • chimera_split: Options related to how to handle transcripts with multiple valid ORFs.
  • files: Input and output files.
  • orf_loading: Options related to how to decide which ORFs to load onto each transcript.
  • output_format: options related to how to format the names of the transcripts, the source field of the GFFs, etc.
  • run_options: Generic options related either to the general algorithm or to the number of resources requested.
  • scoring_file: This value specifies the scoring file to be used for Mikado. These can be found in Mikado.configuration.scoring_files.

Hint

It is possible to ask for the configuration file to be copied in-place for customisation when calling mikado configure.

  • source_score: in this section, it is possible to specify boni/mali to be assigned to specific labels. Eg, it might be possible to assign a bonus of 1 to any transcript coming from PacBio reads, or a malus to any transcript coming from a given assembler. Example of such a configuration:
pick:
    source_score:
        - Cufflinks: 0
        - Trinity: 0
        - PacBio: 2
        - Stringtie: 1

In this example, we asked Mikado to consider Stringtie transcripts as more trustworthy than the rest (1 additional point), and PacBio transcripts even more so (2 additional points).

Each subsection of the pick configuration will be explained in its own right.

Parameters regarding the alternative splicing

After selecting the best model for each locus, Mikado will backtrack and try to select valid alternative splicing events. This section deals with how Mikado will operate the selection. In order to be considered as valid potential AS events, transcripts have to satisfy the minimum requirements specified in the scoring file. These are the available parameters:

  • report: boolean. Whether to calculate and report possible alternative splicing events at all. By default this is set to true; setting this parameter to false will inactivate all the options in this section.
  • keep_retained_introns: boolean. It specifies whether transcripts with retained introns will be retained. A retained intron is defined as an exon at least partly non-coding, whose non-coding part falls within the intron of another transcript (so, retained intron events which yield a valid ORF will not be excluded). By default, such transcripts will be excluded.
  • min_cdna_overlap: minimum cDNA overlap between the primary transcript and the AS candidate. By default, this is set to 0 and we rely only on the class code and the CDS overlap. It must be a number between 0 and 1.
  • min_cds_overlap: minimum CDS overlap between the primary transcript and the AS candidate. By default this is set to 0.6, ie 60%. It must be a number between 0 and 1.
  • min_score_perc: Minimum percentage of the score of the primary transcript that any candidate AS must have to be considered. By default, this is set to 0.6 (60%). It must be a number between 0 and 1.
  • only_confirmed_introns: boolean. This parameter determines whether to consider only transcripts whose introns are confirmed in the dataset of reliable junctions, or whether to consider all possible candidate transcripts.
  • redundant_ccodes: any candidate AS will be compared against all the transcripts already retained in the locus. If any of these comparisons returns one of the class codes specified in this array, the transcript will be ignored. Default class codes: =, _, m, c, n, C
  • valid_ccodes: any candidate AS will be compared against the primary transcript to determine the type of AS event. If the class code is one of those specified in this array, the transcript will be considered further. Valid class codes are within the categories “Alternative splicing”, “Extension” with junction F1 lower than 100%, and Overlap (with the exclusion of “m”). Default class codes: j, J, g, G, h.
  • pad: boolean option. If set to True, Mikado will try to pad transcripts so that they share the same 5’. Disabled by default.
  • ts_max_splices: numerical. When padding is activated, at most how many splice junctions can the extended exon cross?
  • ts_distance: numerical. When padding is activated, at most of how many base pairs can an exon be extended?

Warning

the AS transcript event does not need to be a valid AS event for all transcripts in the locus, only against the primary transcript.

alternative_splicing:
      #  Parameters related to alternative splicing reporting.
      #  - report: whether to report at all or not the AS events.
      #  - min_cds_overlap: minimum overlap between the CDS of the primary transcript
      #  and any AS event. Default: 60%.
      #  - min_cdna_overlap: minimum overlap between the CDNA of the primary transcript
      #  and any AS event.
      #  Default: 0% i.e. disabled, we check for the CDS overlap.
      #  - keep_retained_introns: Whether to consider as valid AS events where one intron
      #  is retained compared to the primary or any other valid AS. Default: false.
      #  - max_isoforms: Maximum number of isoforms per locus. 1 implies no AS reported.
      #  Default: 3
      #  - valid_ccodes: Valid class codes for AS events. Valid codes are in categories
      #  Alternative splicing, Extension (with junction F1 lower than 100%),
      #  and Overlap (exluding m). Default: j, J, g, G, C, h
      #  - max_utr_length: Maximum length of the UTR for AS events. Default: 10e6 (i.e.
      #  no limit)
      #  - max_fiveutr_length: Maximum length of the 5UTR for AS events. Default:
      #  10e6 (i.e. no limit)
      #  - max_threeutr_length: Maximum length of the 5UTR for AS events. Default:
      #  10e6 (i.e. no limit)
      #  - min_score_perc: Minimum score threshold for subsequent AS events.
      #   Only transcripts with a score at least (best) * value are retained.
      #  - only_confirmed_introns: bring back AS events only when their introns are
      #  either
      #   present in the primary transcript or in the set of confirmed introns.
      #  - pad: boolean switch. If true, Mikado will pad all the transcript in a gene
      #  so that their ends are the same
      #  - ts_distance: if padding, this is the maximum distance in base-pairs between
      #  the starts of transcripts
      #    to be considered to be padded together.
      #  - ts_max_splices: if padding, this is the maximum amount of splicing junctions
      #  that the transcript to pad
      #   is allowed to cross. If padding would lead to cross more than this number,
      #  the transcript will not be padded.
      keep_retained_introns: false
      max_isoforms: 5
      min_cdna_overlap: 0.5
      min_cds_overlap: 0.75
      min_score_perc: 0.5
      only_confirmed_introns: true
      pad: false
      redundant_ccodes:
      - c
      - m
      - _
      - '='
      - n
      report: true
      ts_distance: 300
      ts_max_splices: 1
      valid_ccodes:
      - j
      - J
      - C
      - G
      - g
      - h
Parameters regarding the clustering of transcripts in loci

Note

New in version 1 beta 10.

This section influences how Mikado clusters transcripts in its multi-stage selection. The available parameters are:

  • flank: numerical. When constructing Superloci, Mikado will use this value as the maximum distance

between transcripts for them to be integrated within the same superlocus. * cds_only: boolean. If set to true, during the picking stage Mikado will consider only the primary ORF to evaluate whether two transcripts intersect. Transcripts which eg. share introns in their UTR but have completely unrelated CDSs will be clustered separately. Disabled by default. * purge: boolean. If true, any transcript failing the specified requirements will be purged out. Otherwise, they will be assigned a score of 0 and might potentially appear in the final output, if no other transcript is present in the locus. * simple_overlap_for_monoexonic: boolean. During the second clustering, by default monoexonic transcripts are clustered together even if they have a very slight overlap with another transcript. Manually setting this flag to false will cause Mikado to cluster monoexonic transcripts only if they have a minimum amount of cDNA and CDS overlap with the other transcripts in the holder. * min_cdna_overlap: numerical, between 0 and 1. Minimum cDNA overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail. * min_cdna_overlap: numerical, between 0 and 1. Minimum CDS overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail.

clustering:
    #  Parameters related to the clustering of transcripts into loci.
    #  - cds_only: boolean, it specifies whether to cluster transcripts only according
    #  to their CDS (if present).
    #  - min_cds_overlap: minimal CDS overlap for the second clustering.
    #  - min_cdna_overlap: minimal cDNA overlap for the second clustering.
    #  - flank: maximum distance for transcripts to be clustered within the same superlocus.
    #  - remove_overlapping_fragments: boolean, it specifies whether to remove putative
    #  fragments.
    #  - purge: boolean, it specifies whether to remove transcripts which fail the
    #  minimum requirements check - or whether to ignore those requirements altogether.
    #  - simple_overlap_for_monoexonic: boolean. If set to true (default), then any
    #  overlap mean inclusion
    #  in a locus for or against a monoexonic transcript. If set to false, normal controls
    #  for the percentage
    #  of overlap will apply.
    #  - max_distance_for_fragments: maximum distance from a valid locus for another
    #  to be considered a fragment.
    cds_only: false
    flank: 200
    min_cdna_overlap: 0.2
    min_cds_overlap: 0.2
    purge: true
    simple_overlap_for_monoexonic: true
Parameters regarding the detection of putative fragments

This section determines how Mikado treats potential fragments in the output. Available options:

  • remove: boolean, default true. If set to true, fragments will be excluded from the final output; otherwise, they will be printed out, but properly tagged.
  • max_distance: numerical. For non-overlapping fragments, this value determines the maximum distance from the valid gene. Eg. with the default setting of 2000, a putative fragment at the distance of 1000 will be tagged and dealt with as a fragment; an identical model at a distance of 3000 will be considered as a valid gene and left untouched.
  • valid_class_codes: valid class codes for potential fragments. Only Class Codes in the categories Overlap, Intronic, Fragment, with the addition of “_”, are considered as valid choices.
fragments:
  #  Parameters related to the handling of fragments.
  #  - remove: boolean. Whether to remove fragments or leave them, properly tagged.
  #  - max_distance: maximum distance of a putative fragment from a valid gene.
  #  - valid_class_codes: which class codes will be considered as fragments. Default:
  #  (p, P, x, X, i, m, _). Choices: _ plus any class code with category
  #  Intronic, Fragment, or Overlap.
  max_distance: 2000
  remove: true
  valid_class_codes:
  - p
  - P
  - x
  - X
  - i
  - m
  - _
Parameters regarding assignment of ORFs to transcripts

This section of the configuration file deals with how to determine valid ORFs for a transcript from those present in the database. The parameters to control the behaviour of Mikado are the following:

  • minimal_orf_length: minimal length of the primary ORF to be loaded onto the transcript. By default, this is set at 50 bps (not aminoacids)
  • minimal_secondary_orf_length: minimal length of any ORF that can be assigned to the transcript after the first. This value should be set at a higher setting than minimal_orf_length, in order to avoid loading uORFs [uORFs] into the transcript, leading to spurious break downs of the UTRs. Default: 200 bps.
  • strand_specific: boolean. If set to true, only ORFs on the plus strand (ie the same of the cDNA) will be considered. If set to false, monoexonic transcripts mihgt have their strand flipped.
pick:
    orf_loading:
      #  Parameters related to ORF loading.
      #  - minimal_secondary_orf_length: Minimum length of a *secondary* ORF
      #    to be loaded after the first, in bp. Default: 200 bps
      #  - minimal_orf_length: Minimum length in bps of an ORF to be loaded,
      #    as the primary ORF, onto a transcript. Default: 50 bps
      #  - strand_specific: Boolean flag. If set to true, monoexonic transcripts
      #    will not have their ORF reversed even if they would have an ORF on the opposite
      #  strand.
      minimal_orf_length: 50
      minimal_secondary_orf_length: 200
      strand_specific: true
Parameters regarding splitting of chimeras

This section of the configuration file specifies how to deal with transcripts presenting multiple ORFs, ie putative chimeras (see the section above for parameters related to which ORFs can be loaded). Those are identified as transcripts with more than one ORF, where:

  • all the ORFs share the same strand
  • all the ORFs are non-overlapping, ie they do not share any bp

In these situations, Mikado can try to deal with the chimeras in five different ways, in decreasingly conservative fashion:

  • nosplit: leave the transcript unchanged. The presence of multiple ORFs will affect the scoring.
  • stringent: leave the transcript unchanged, unless the two ORFs both have hits in the protein database and none of the hits is in common.
  • lenient: leave the transcript unchanged, unless either the two ORFs both have hits in the protein database, none of which is in common, or both have no hits in the protein database.
  • permissive: presume the transcript is a chimera, and split it, unless two ORFs share a hit in the protein database.
  • split: presume that every transcript with more than one ORF is incorrect, and split them.

If any BLAST hit spans the two ORFs, then the model will be considered as a non-chimera because there is evidence that the transcript constitutes a single unit. The only case when this information will be disregarded is during the execution of the split mode.

These modes can be controlled directly from the pick command line.

The behaviour, and when to trigger the check, is controlled by the following parameters:

  • execute: boolean. If set to false, Mikado will operate in the nosplit mode. If set to true, the choice of the mode will be determined by the other parameters.

  • blast_check: boolean. Whether to execute the check on the BLAST hits. If set to false, Mikado will operate in the split mode, unless execute is set to false (execute takes precedence over the other parameters).

  • blast_params: this section contains the settings relative to the permissive, lenient and stringent mode.
    • evalue: maximum evalue of a hit to be assigned to the transcript and therefore be considered.
    • hsp_evalue: maximum evalue of a hsp inside a hit to be considered for the analysis.
    • leniency: one of LENIENT, PERMISSIVE, STRINGENT. See above for definitions.
    • max_target_seqs: integer. when loading BLAST hits from the database, only the first N will be considered for analysis.
    • minimal_hsp_overlap: number between 0 and 1. This indicates the overlap that must exist between the HSP and the ORF for the former to be considered for the split.
    • min_overlap_duplication: in the case of tandem duplicated genes, a chimera will have two ORFs that share the same hits, but possibly in a peculiar way - the HSPs will insist on the same region of the target sequence. This parameter controls how much overlap counts as a duplication. The default value is of 0.9 (90%).
pick:
    chimera_split:
      #  Parameters related to the splitting of transcripts in the presence of
      #  two or more ORFs. Parameters:
      #  - execute: whether to split multi-ORF transcripts at all. Boolean.
      #  - blast_check: whether to use BLAST information to take a decision. See blast_params
      #  for details.
      #  - blast_params: Parameters related to which BLAST data we want to analyse.
      blast_check: true
      blast_params:
        #  Parameters for the BLAST check prior to splitting.
        #  - evalue: Minimum evalue for the whole hit. Default: 1e-6
        #  - hsp_evalue: Minimum evalue for any HSP hit (some might be discarded even
        #  if the whole hit is valid). Default: 1e-6
        #  - leniency: One of STRINGENT, LENIENT, PERMISSIVE. Default: LENIENT
        #  - max_target_seqs: maximum number of hits to consider. Default: 3
        #  - minimal_hsp_overlap: minimum overlap of the ORF with the HSP (*not* reciprocal).
        #  Default: 0.8, i.e. 80%
        #  - min_overlap_duplication: minimum overlap (in %) for two ORFs to consider
        #  them as target duplications.
        #    This means that if two ORFs have no HSPs in common, but the coverage of
        #  their disjoint HSPs covers more
        #    Than this % of the length of the *target*, they represent most probably
        #  a duplicated gene.
        evalue: 1.0e-06
        hsp_evalue: 1.0e-06
        leniency: LENIENT
        max_target_seqs: 3
        min_overlap_duplication: 0.8
        minimal_hsp_overlap: 0.9
      execute: true
Parameters regarding input and output files

The “files” and “output_format” sections deal respectively with input files for the pick stage and with some basic settings for the GFF output. Options:

  • input: input GTF file for the run. It should be the one generated by the prepare stage, ie the out file of the prepare stage.
  • loci_out: main output file. It contains the winning transcripts, separated in their own gene loci, in GFF3 format. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
  • log: name of the log file. Default: mikado_pick.log
  • monoloci_out: this optional output file will contain the transcripts that have been passed to the monoloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
  • subloci_out: this optional output file will contain the transcripts that have been passed to the subloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
  • output_format: this section specifies some details on the output format.
    • id_prefix: prefix for all the final Mikado models. The ID will be <prefix>.<chromosome>G<progressive ID>.
    • report_all_orfs: some Mikado models will have more than one ORF (unless pick is operating in the split mode). If this option is set to true, Mikado will report the transcript multiple times, one for each ORF, using different progressive IDs (<model name>.orf<progressive ID>). By default, this option is set to False, and only the primary ORF is reported.
    • source: prefix for the source field in the output files. Loci GFF3 will have “<prefix>_loci”, subloci GFF3s will have “<prefix>_subloci”, and monoloci will have “<prefix>_monoloci”.
pick:
   files:
     #  Input and output files for Mikado pick.
     #  - gff: input GTF/GFF3 file. Default: mikado_prepared.gtf
     #  - loci_out: output GFF3 file from Mikado pick. Default: mikado.loci.gff3
     #  - subloci_out: optional GFF file with the intermediate subloci. Default: no
     #  output
     #  - monoloci_out: optional GFF file with the intermediate monoloci. Default:
     #  no output
     #  - log: log file for this step.
     input: mikado_prepared.gtf
     loci_out: mikado.loci.gff3
     log: mikado_pick.log
     monoloci_out: ''
     output_dir: .
     subloci_out: ''
   output_format:
     #  Parameters related to the output format.
     #    - source: prefix for the source field in the mikado output.
     #    - id_prefix: prefix for the ID of the genes/transcripts in the output
     id_prefix: mikado
     report_all_orfs: false
     source: Mikado
Generic parameters on the pick run

This section deals with other parameters necessary for the run, such as the number of processors to use, but also more important algorithmic parameters such as how to recognise fragments. Parameters:

  • consider_truncated_for_retained: normally, Mikado considers as retained introns only events in which a partially coding exon on the 3’ side becomes non-coding in the middle of a CDS intron of another transcript in the locus. If this option is set to true, Mikado will consider as retained intron events also cases when the transcript has its CDS just end within a CDS intron of another model. Useful eg. when dealing with CDS models.
  • exclude_cds: whether to remove CDS/UTR information from the Mikado output. Default: false.
  • intron_range: tuple that indicates the range of lengths in which most introns should fall. Transcripts with introns either shorter or longer than this interval will be potentially penalised, depending on the scoring scheme. For the paper, this parameter was set to a tuple of integers in which 98% of the introns of the reference annotation were falling (ie cutting out the 1st and 99th percentiles).
  • preload: boolean. In certain cases, ie when the database is quite small, it might make sense to preload it in memory rather than relying on SQL queries. Set to false by default.
  • shm: boolean. In certain cases, especially when disk access is a severely limiting factor, it might make sense to copy a SQLite database into RAM before querying. If this parameter is set to true, Mikado will copy the SQLite database into a temporary file in RAM, and query it from there.
  • shm_db: string. If shm is set to true and this string is non-empty, Mikado will copy the database in memory to a file with this name and leave it there for other Mikado runs. The file will have to be removed manually.
  • procs: number of processors to use. Default: 1.
  • single_thread: boolean. If set to true, Mikado will completely disable multiprocessing. Useful mostly for debugging reasons.

Warning

the shared-memory options are available only on Linux platforms.

run_options:
  #  Generic run options.
  #  - shm: boolean flag. If set and the DB is sqlite, it will be copied onto the
  #  /dev/shm faux partition
  #  - shm_db: String. It indicates a DB that has to be copied onto SHM and left
  #  there for
  #    concurrent Mikado runs.
  #  - shm_shared: boolean flag. If set, the database loaded onto SHM will be shared
  #  and should not be
  #    deleted at the end of the run (see shm_db).
  #    for faster access. Default: false
  #  - exclude_cds: boolean flag. If set, the CDS information will not be printed
  #  in Mikado output. Default: false
  #  - procs: number of processes to use. Default: 1
  #  - preload: boolean flag. If set, the whole database will be preloaded into
  #  memory for faster access. Useful when
  #    using SQLite databases.
  #  - single_thread: boolean flag. If set, multithreading will be disabled - useful
  #  for profiling and debugging.
  #  - consider_truncated_for_retained: boolean. Normally, Mikado considers only
  #  exons which span a whole intron as possible retained intron events. If this
  #  flag is set to true, also terminal exons will be considered.
  #  - remove_overlapping_fragments: DEPRECATED, see clustering.
  #  - purge: DEPRECATED, see clustering.
  consider_truncated_for_retained: false
  exclude_cds: false
  intron_range:
  - 60
  - 900
  preload: false
  procs: 1
  shm: false
  shm_db: ''
  single_thread: false
Miscellanea

It is possible to set high-level settings for the logs in the log_settings section:

  • log_level: level of the logging for Mikado. Options: DEBUG, INFO, WARNING, ERROR, CRITICAL. By default, Mikado will be quiet and output log messages of severity WARNING or greater.
  • sql_level: level of the logging for messages regarding the database connection (through SQLAlchemy). By default, SQLAlchemy will be set in quiet mode and asked to output only messages of severity WARNING or greater.

Warning

Mikado and SQLAlchemy can be greatly verbose if asked to output DEBUG or INFO messages, to the point of slowing down the program significantly due to the amount of writing to disk. Please consider setting the level to DEBUG only when there is a real problem to debug, not otherwise!

log_settings:
  #  Settings related to the logs. Keys:
  #  - sql_level: verbosity for SQL calls. Default: WARNING.
  #    In decreasing order: DEBUG, INFO, WARNING, ERROR, CRITICAL
  #  - log_level: verbosity. Default: WARNING.
  #    In decreasing order: DEBUG, INFO, WARNING, ERROR, CRITICAL
  log_level: WARNING
  sql_level: WARNING

It is also possible to set the type of multiprocessing method that should be used by Python3. The possible choices are “fork”, “spawn”, and “fork-server”.

multiprocessing_method: spawn
Technical details

The configuration file obeys a specific JSON schema which can be found at Mikado/configuration/configuration_blueprint.json. Every time a Mikado utility is launched, it checks the configuration file against the schema to validate it. The schema contains non-standard “Comment” and “SimpleComment” string arrays which are used at runtime to generate the comment strings in the YAML output.

Mikado prepare

This is the first executive step of the Mikado pipeline. It will accomplish the following goals:

  1. Collect annotations from disparate annotation files
  2. Remove redundant assemblies, ie, assemblies that are identical across the various input files.
  3. Determine the strand of the transcript junctions
  4. Ensure uniqueness of the transcript names
  5. Order the transcript by locus
  6. Extract the transcript sequences.
Usage

Mikado prepare allows to override some of the parameters present in the configuration file through command line options, eg the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.

Available parameters:

  • json-conf: the most important parameter. This is the configuration file created through Mikado configure.
  • fasta: reference genome. Required, either through the command line or through the configuration file.
  • out: Output GTF file, with the collapsed transcripts.
  • out_fasta: Output FASTA file of the collapsed transcripts.
  • start-method: multiprocessing start method.
  • verbose, quiet: flags to set the verbosity of Mikado prepare. It is generally not advised to turn the verbose mode on, unless there is a problem to debug, given the verbosity of the output.
  • strand-specific: If set, all assemblies will be treated as strand-specific.
  • strand-specific-assemblies: comma-separated list of strand specific assemblies.
  • list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with the following contents:
    1. Location of the file
    2. label for the file
    3. whether that assembly is strand-specific or not (write True/False)
    4. Optionally, a bonus/malus to be associated to transcripts coming from that assembly.
  • log: log file. Optional, by default Mikado will print to standard error.
  • lenient: flag. If set, transcripts without any canonical splice site will be output as well. By default, they would be discarded.
  • single: flag that disables multiprocessing. Mostly useful for debug purposes.
  • strip-cds: some aligners (eg GMAP) will try calculate a CDS on the fly for alignments. Use this flag to discard such CDS sections and retain only the cDNA information.
  • minimum_length: minimum length of the transcripts to be kept.

Command line usage:

$ mikado prepare --help
usage: Mikado prepare [-h] [--fasta FASTA] [-v | -q]
                      [--start-method {fork,spawn,forkserver}]
                      [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST]
                      [-l LOG] [--lenient] [-m MINIMUM_LENGTH] [-p PROCS]
                      [-scds] [--labels LABELS] [--single] [-od OUTPUT_DIR]
                      [-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF]
                      [gff [gff ...]]

Mikado prepare analyses an input GTF file and prepares it for the picking
analysis by sorting its transcripts and performing some simple consistency
checks.

positional arguments:
  gff                   Input GFF/GTF file(s).

optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA         Genome FASTA file. Required.
  -v, --verbose
  -q, --quiet
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  -s, --strand-specific
                        Flag. If set, monoexonic transcripts will be left on
                        their strand rather than being moved to the unknown
                        strand.
  -sa STRAND_SPECIFIC_ASSEMBLIES, --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
                        Comma-delimited list of strand specific assemblies.
  --list LIST           Tab-delimited file containing rows with the following
                        format <file> <label> <strandedness>
  -l LOG, --log LOG     Log file. Optional.
  --lenient             Flag. If set, transcripts with only non-canonical
                        splices will be output as well.
  -m MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                        Minimum length for transcripts. Default: 200 bps.
  -p PROCS, --procs PROCS
                        Number of processors to use (default 1)
  -scds, --strip_cds    Boolean flag. If set, ignores any CDS/UTR segment.
  --labels LABELS       Labels to attach to the IDs of the transcripts of the
                        input files, separated by comma.
  --single              Disable multi-threading. Useful for debugging.
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
  -o OUT, --out OUT     Output file. Default: mikado_prepared.gtf.
  -of OUT_FASTA, --out_fasta OUT_FASTA
                        Output file. Default: mikado_prepared.fasta.
  --json-conf JSON_CONF
                        Configuration file.
Collection of transcripts from the annotation files

Different assemblers will produce data in different formats, typically in GFF or GTF format, and not necessarily in the same order (if any is present). Mikado will serialise the transcripts from these files and port them all into a standard GTF format. Moreover, it will ensure that each transcript ID appears only once across the input files. The optional labels provided for each file will be attached to the transcript names as prefixes, and used as the source field in the output GTF, to ensure the uniqueness of each transcript name. If two or more transcripts are found to be identical, only one will be retained, chosen at random among all the possibilities. In addition to this, Mikado prepare will also sort the transcripts by coordinate, irrespective of strand, so that they are suitably displayed for the divide-et-impera algorithm of Mikado pick.

Warning

To be considered identical, two transcripts must match down to the last base pair. A simple match or containment of the intron chain will not suffice. This is because using the cDNA data alone it is difficult to understand whether the longer form(s) is the correct assembly rather than a chimera or a trans-splice event.

Check on strand correctness

During its run, Mikado prepare will also check the correctness of the transcripts. In particular:

  • Unless the assembly is marked as strand-specific, any monoexonic transcript will have its strand removed.
  • If a transcript contains canonical splice junctions on both strands, it will be completely removed
  • If a transcript contains only non-canonical splice junctions, it will be removed unless the “lenient” option is specified either at the command line or in the configuration file.

The couples of splice acceptors and donors which are considered as canonical can be specified in the configuration file. By default, Mikado will consider as canonical both properly canonical splicing event (GT-AG) as well as the semi-canonical events (GC-AG, AT-AC). Any other couple will be considered as non-canonical.

Warning

Mikado will check the strand of each junction inside a transcript independently. Therefore, if a transcript with 9 junctions on the plus strand is found to have a non-canonical splicing junction which happens to be the reverse of a canonical one (eg. CT-AC), it will deem this junction as misassigned to the wrong strand and flip it to the minus strand. In this example, the transcript will therefore be considered as an error as it contains both + and - junctions, and discarded.

Output files

Mikado prepare will produce two files:

  • a sorted GTF file, containing all the transcripts surviving the checks
  • a FASTA file of the transcripts, in the proper cDNA orientation.

Warning

contrary to other tools such as eg gffread from Cufflinks [Cufflinks], Mikado prepare will not try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will not have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately.

Mikado serialise

Mikado integrates data from multiple sources to select the best transcripts. During this step, these sources are brought together inside a common database, simplifying the process of retrieving them at runtime. Currently, Mikado integrates three different types of data:

  1. reliable junctions, as detected by Portcullis, in BED12 format (
  2. ORF data, currently identified using TransDecoder, but any program capable of generating it in BED12 format is suitable.
  3. BLASTX [Blastplus] data in XML format

After serialisation in the database, these sources will be available to use for any subsequent Mikado pick run. Having the data present in the database allows to run Mikado with multiple configurations and little overhead in terms of pre-loading data; this feature is taken directly advtange of in The Daijin pipeline for driving Mikado, where it is possible to run Mikado using multiple modes.

Mikado serialise can use three different SQL databases as backends - SQLite, MySQL and PostgreSQL - thanks to SQLAlchemy. This step, together with the creation of the TransDecoder and BLAST data, is the most time consuming of the pipeline. In particular, although Mikado serialise will try to analyse the XML data in a parallelised fashion if so instructed, the insertion of the data in the database will still happen in a single thread and will therefore be of limited speed. If using SQLite as database (the default option), it is possible to decrease the runtime by modifying the “max_objects” parameters, at the cost however of increased RAM usage.

Transdecoder ORFs

When Mikado analyses ORFs produced by TransDecoder or equivalent program, it performs additionally the following checks:

  1. Check the congruence between the length of the transcript in the BED12 file and that found in the FASTA file
  2. Check that the ORF does not contain internal stop codons
  3. Check that the CDS length is valid, ie a multiple of 3, if the ORF is complete
  4. Optionally, if the ORF is open on the 5’ side, Mikado can try to find an internal start codon. See :this section <max-regression> for details.
Usage

mikado serialise allows to override some of the parameters present in the configuration file through command line options, eg the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.

Available parameters:

  • Parameters related to performance:
    • start-method: one of fork, spawn, forkserver. It determines the multiprocessing start method. By default, Mikado will use the default for the system (fork on UNIX, spawn on Windows).
    • procs: Number of processors to use.
    • single-thread: flag. If set, Mikado will disable all multithreading.
    • max_objects: Maximum number of objects to keep in memory before committing to the database. See this section of the configuration for details.
  • Basic input data and settings:
    • output-dir: directory where the SQLite database and the log will be written to.
    • transcripts: these are the input transcripts that are present on the GTF file considered by Mikado. Normally this should be the output of Mikado prepare.
    • genome_fai: FAIDX file of the genome FASTA. If not given, serialise will derive it from the “reference: genome” field of the configuration.
    • force: flag. If set, and the database is already present, it will be truncated rather than updated.
    • json-conf: this is the configuration file created with Mikado configure.
    • db: if the database is specified on the command line, mikado serialise will interpret it as a SQLite database. This will overwrite any setting present in the configuration file.
  • Parameters related to logging:
    • log: log file. It defaults to serialise.log.
    • log_level: verbosity of the logging. Please be advised that excessive verbosity can negatively impact the performance of the program - the debug mode is extremely verbose.
  • Parameters related to reliable junctions:
    • junctions: a BED12 file of reliable junctions. This can be obtained using Portcullis. Please see the relative sidebar.
  • Parameters related to the treatment of ORF data:
    • orfs: ORF BED12 files, separated by comma.
    • max-regression: A percentage, expressed as a number between 0 and 1, which indicates how far can Mikado regress along the ORF to find a valid start codon. See the relative section in the configuration for details.
  • Parameters related to BLAST data:
    • blast_targets: BLAST FASTA database.
    • discard-definition: Flag. Depending on how the database has been created, sometimes BLAST will substitute the ID of the sequence with “lcl|” ids. Mikado circumvents this by looking for the definition field in the XML file. Using this flag will disable this behaviour and force Mikado to use the ID - with the potential of having a mismatch between the sequences in the BLAST DB and the sequences in the BLAST files.
    • xml: BLAST files to parse. This can be one of the following:
      • A list of XML BLAST files, optionally compressed with GZip or BZip2, comma separated (suffix .xml)
      • A list of ASN BLAST files, optionally compressed with GZip or BZip2, comma separated (suffix .asn)
      • A list of folders, comma separated, where it is possible to find files of the former 2 types
      • A mixture of the three above types.
    • max-target-seqs: maximum number of BLAST targets that can be loaded per sequence, for each BLAST alignment. Please note that if you align against multiple databases, this threshold will be applied once per file.

Hint

Mikado will parallelise only the reading of multiple XML files. As such, this part of the pipeline is less performing than the other steps.

Warning

It is advised to set this parameter to spawn even on UNIX. See the dedicated sidebar for details.

Usage:

$ mikado serialise --help
usage: Mikado serialise [-h] [--start-method {fork,spawn,forkserver}]
                        [--orfs ORFS] [--transcripts TRANSCRIPTS]
                        [-mr MAX_REGRESSION]
                        [--max_target_seqs MAX_TARGET_SEQS]
                        [--blast_targets BLAST_TARGETS] [--discard-definition]
                        [--xml XML] [-p PROCS] [--single-thread]
                        [--genome_fai GENOME_FAI] [--junctions JUNCTIONS]
                        [-mo MAX_OBJECTS] [-f] --json-conf JSON_CONF
                        [-l [LOG]] [-od OUTPUT_DIR]
                        [-lv {DEBUG,INFO,WARN,ERROR}]
                        [db]

Mikado serialise creates the database used by the pick program. It handles
Junction and ORF BED12 files as well as BLAST XML results.

optional arguments:
  -h, --help            show this help message and exit
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method.
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory

  --orfs ORFS           ORF BED file(s), separated by commas
  --transcripts TRANSCRIPTS
                        Transcript FASTA file(s) used for ORF calling and
                        BLAST queries, separated by commas. If multiple files
                        are given, they must be in the same order of the ORF
                        files. E.g. valid command lines are:
                        --transcript_fasta all_seqs1.fasta --orfs all_orfs.bed
                        --transcript_fasta seq1.fasta,seq2.fasta --orfs
                        orfs1.bed,orf2.bed --transcript_fasta all_seqs.fasta
                        --orfs orfs1.bed,orf2.bed These are invalid instead: #
                        Inverted order --transcript_fasta
                        seq1.fasta,seq2.fasta --orfs orfs2.bed,orf1.bed #Two
                        transcript files, one ORF file --transcript_fasta
                        seq1.fasta,seq2.fasta --orfs all_orfs.bed
  -mr MAX_REGRESSION, --max-regression MAX_REGRESSION
                        "Amount of sequence in the ORF (in %) to backtrack in
                        order to find a valid START codon, if one is absent.
                        Default: None

  --max_target_seqs MAX_TARGET_SEQS
                        Maximum number of target sequences.
  --blast_targets BLAST_TARGETS
                        Target sequences
  --discard-definition  Flag. If set, the sequences IDs instead of their
                        definition will be used for serialisation.
  --xml XML             XML file(s) to parse. They can be provided in three
                        ways: - a comma-separated list - as a base folder -
                        using bash-like name expansion (*,?, etc.). In this
                        case, you have to enclose the filename pattern in
                        double quotes. Multiple folders/file patterns can be
                        given, separated by a comma.
  -p PROCS, --procs PROCS
                        Number of threads to use for analysing the BLAST
                        files. This number should not be higher than the total
                        number of XML files.
  --single-thread       Force serialise to run with a single thread,
                        irrespective of other configuration options.

  --genome_fai GENOME_FAI
  --junctions JUNCTIONS

  -mo MAX_OBJECTS, --max-objects MAX_OBJECTS
                        Maximum number of objects to cache in memory before
                        committing to the database. Default: 100,000 i.e.
                        approximately 450MB RAM usage for Drosophila.
  -f, --force           Flag. If set, an existing databse will be deleted
                        (sqlite) or dropped (MySQL/PostGreSQL) before
                        beginning the serialisation.
  --json-conf JSON_CONF
  -l [LOG], --log [LOG]
                        Optional log file. Default: stderr
  -lv {DEBUG,INFO,WARN,ERROR}, --log_level {DEBUG,INFO,WARN,ERROR}
                        Log level. Default: INFO
  db                    Optional output database. Default: derived from
                        json_conf
Technical details

The schema of the database is quite simple, as it is composed only of 7 discrete tables in two groups. The first group, chrom and junctions, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations. The second group serialises the data regarding ORFs and BLAST files. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime.

Database schema used by Mikado.

_images/database_schema.png
Mikado pick

This is the final stage of the pipeline, in which Mikado identifies gene loci and selects the best transcripts.

Input files

mikado pick requires as input files the following:

  1. A sorted GTF files with unique transcript names, derived through the prepare stage.
  2. A database containing all the external data to be integrated with the transcript structure, derived through the serialisation stage.
  3. A scoring file specifying the minimum requirements for transcripts and the relevant metrics for scoring. See the section on scoring files for details.
Output files

mikado pick will produce three kinds of output files: a GFF3 file, a metrics file, and a scores file. This triad will be produced for the loci level, and optionally also for the subloci and monoloci level.

GFF3 files

This output file is a standard-compliant GFF, with the addition of the superloci to indicate the original spans. An example with two superloci, one on the negative and one on the positive strand, follows:

##gff-version 3
##sequence-region Chr5 1 26975502
Chr5    Mikado_loci     superlocus      26584796        26601707        .       +       .       ID=Mikado_superlocus:Chr5+:26584796-26601707;Name=superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     gene    26584796        26587912        23      +       .       ID=mikado.Chr5G1;Name=mikado.Chr5G1;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26584796        26587912        24      +       .       ID=mikado.Chr5G1.2;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.2;alias=st_Stringtie_STAR.21710.1;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;ccode=j;cov=25.165945;primary=False
Chr5    Mikado_loci     exon    26584796        26584879        .       +       .       ID=mikado.Chr5G1.2.exon1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     five_prime_UTR  26584796        26584879        .       +       .       ID=mikado.Chr5G1.2.five_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585220        26585273        .       +       .       ID=mikado.Chr5G1.2.exon2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     five_prime_UTR  26585220        26585222        .       +       .       ID=mikado.Chr5G1.2.five_prime_UTR2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585223        26585273        .       +       0       ID=mikado.Chr5G1.2.CDS1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585345        26585889        .       +       0       ID=mikado.Chr5G1.2.CDS2;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585345        26585889        .       +       .       ID=mikado.Chr5G1.2.exon3;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26585982        26586102        .       +       1       ID=mikado.Chr5G1.2.CDS3;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26585982        26586102        .       +       .       ID=mikado.Chr5G1.2.exon4;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586217        26586294        .       +       0       ID=mikado.Chr5G1.2.CDS4;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586217        26586294        .       +       .       ID=mikado.Chr5G1.2.exon5;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586420        26586524        .       +       0       ID=mikado.Chr5G1.2.CDS5;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586420        26586524        .       +       .       ID=mikado.Chr5G1.2.exon6;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586638        26586850        .       +       0       ID=mikado.Chr5G1.2.CDS6;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586638        26586850        .       +       .       ID=mikado.Chr5G1.2.exon7;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26586934        26586996        .       +       0       ID=mikado.Chr5G1.2.CDS7;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26586934        26586996        .       +       .       ID=mikado.Chr5G1.2.exon8;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587084        26587202        .       +       0       ID=mikado.Chr5G1.2.CDS8;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587084        26587202        .       +       .       ID=mikado.Chr5G1.2.exon9;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587287        26587345        .       +       1       ID=mikado.Chr5G1.2.CDS9;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587287        26587345        .       +       .       ID=mikado.Chr5G1.2.exon10;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     CDS     26587427        26587755        .       +       2       ID=mikado.Chr5G1.2.CDS10;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     exon    26587427        26587912        .       +       .       ID=mikado.Chr5G1.2.exon11;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     three_prime_UTR 26587756        26587912        .       +       .       ID=mikado.Chr5G1.2.three_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5    Mikado_loci     mRNA    26584930        26587912        23      +       .       ID=mikado.Chr5G1.1;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.1;alias=st_Stringtie_STAR.21710.3;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;cov=2.207630;primary=True
Chr5    Mikado_loci     exon    26584930        26585023        .       +       .       ID=mikado.Chr5G1.1.exon1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     five_prime_UTR  26584930        26585023        .       +       .       ID=mikado.Chr5G1.1.five_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585220        26585273        .       +       .       ID=mikado.Chr5G1.1.exon2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     five_prime_UTR  26585220        26585222        .       +       .       ID=mikado.Chr5G1.1.five_prime_UTR2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585223        26585273        .       +       0       ID=mikado.Chr5G1.1.CDS1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585345        26585889        .       +       0       ID=mikado.Chr5G1.1.CDS2;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585345        26585889        .       +       .       ID=mikado.Chr5G1.1.exon3;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26585982        26586102        .       +       1       ID=mikado.Chr5G1.1.CDS3;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26585982        26586102        .       +       .       ID=mikado.Chr5G1.1.exon4;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586217        26586294        .       +       0       ID=mikado.Chr5G1.1.CDS4;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586217        26586294        .       +       .       ID=mikado.Chr5G1.1.exon5;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586420        26586524        .       +       0       ID=mikado.Chr5G1.1.CDS5;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586420        26586524        .       +       .       ID=mikado.Chr5G1.1.exon6;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586638        26586850        .       +       0       ID=mikado.Chr5G1.1.CDS6;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586638        26586850        .       +       .       ID=mikado.Chr5G1.1.exon7;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26586934        26586996        .       +       0       ID=mikado.Chr5G1.1.CDS7;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26586934        26586996        .       +       .       ID=mikado.Chr5G1.1.exon8;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587084        26587202        .       +       0       ID=mikado.Chr5G1.1.CDS8;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587084        26587202        .       +       .       ID=mikado.Chr5G1.1.exon9;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587287        26587345        .       +       1       ID=mikado.Chr5G1.1.CDS9;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587287        26587345        .       +       .       ID=mikado.Chr5G1.1.exon10;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     CDS     26587427        26587755        .       +       2       ID=mikado.Chr5G1.1.CDS10;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     exon    26587427        26587912        .       +       .       ID=mikado.Chr5G1.1.exon11;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     three_prime_UTR 26587756        26587912        .       +       .       ID=mikado.Chr5G1.1.three_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5    Mikado_loci     gene    26588402        26592561        20      +       .       ID=mikado.Chr5G2;Name=mikado.Chr5G2;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26588402        26592561        24      +       .       ID=mikado.Chr5G2.2;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.2;alias=st_Stringtie_STAR.21710.9.split1;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;ccode=j;cov=0.000000;primary=False
Chr5    Mikado_loci     exon    26588402        26588625        .       +       .       ID=mikado.Chr5G2.2.exon1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     five_prime_UTR  26588402        26588625        .       +       .       ID=mikado.Chr5G2.2.five_prime_UTR1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26589203        26589279        .       +       .       ID=mikado.Chr5G2.2.exon2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     five_prime_UTR  26589203        26589237        .       +       .       ID=mikado.Chr5G2.2.five_prime_UTR2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26589238        26589279        .       +       0       ID=mikado.Chr5G2.2.CDS1;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26589386        26590167        .       +       0       ID=mikado.Chr5G2.2.CDS2;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26589386        26590167        .       +       .       ID=mikado.Chr5G2.2.exon3;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590261        26590393        .       +       1       ID=mikado.Chr5G2.2.CDS3;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590261        26590393        .       +       .       ID=mikado.Chr5G2.2.exon4;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590495        26590566        .       +       0       ID=mikado.Chr5G2.2.CDS4;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590495        26590566        .       +       .       ID=mikado.Chr5G2.2.exon5;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590641        26590739        .       +       0       ID=mikado.Chr5G2.2.CDS5;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590641        26590739        .       +       .       ID=mikado.Chr5G2.2.exon6;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26590880        26591092        .       +       0       ID=mikado.Chr5G2.2.CDS6;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26590880        26591092        .       +       .       ID=mikado.Chr5G2.2.exon7;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591174        26591236        .       +       0       ID=mikado.Chr5G2.2.CDS7;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591174        26591236        .       +       .       ID=mikado.Chr5G2.2.exon8;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591324        26591442        .       +       0       ID=mikado.Chr5G2.2.CDS8;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591324        26591442        .       +       .       ID=mikado.Chr5G2.2.exon9;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591520        26591578        .       +       1       ID=mikado.Chr5G2.2.CDS9;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591520        26591578        .       +       .       ID=mikado.Chr5G2.2.exon10;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26591681        26592002        .       +       2       ID=mikado.Chr5G2.2.CDS10;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26591681        26592002        .       +       .       ID=mikado.Chr5G2.2.exon11;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     CDS     26592528        26592561        .       +       1       ID=mikado.Chr5G2.2.CDS11;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     exon    26592528        26592561        .       +       .       ID=mikado.Chr5G2.2.exon12;Parent=mikado.Chr5G2.2
Chr5    Mikado_loci     mRNA    26588402        26592561        20      +       .       ID=mikado.Chr5G2.1;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     exon    26588402        26588625        .       +       .       ID=mikado.Chr5G2.1.exon1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     five_prime_UTR  26588402        26588625        .       +       .       ID=mikado.Chr5G2.1.five_prime_UTR1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26589196        26589279        .       +       .       ID=mikado.Chr5G2.1.exon2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     five_prime_UTR  26589196        26589237        .       +       .       ID=mikado.Chr5G2.1.five_prime_UTR2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26589238        26589279        .       +       0       ID=mikado.Chr5G2.1.CDS1;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26589386        26590167        .       +       0       ID=mikado.Chr5G2.1.CDS2;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26589386        26590167        .       +       .       ID=mikado.Chr5G2.1.exon3;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590261        26590393        .       +       1       ID=mikado.Chr5G2.1.CDS3;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590261        26590393        .       +       .       ID=mikado.Chr5G2.1.exon4;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590495        26590566        .       +       0       ID=mikado.Chr5G2.1.CDS4;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590495        26590566        .       +       .       ID=mikado.Chr5G2.1.exon5;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590641        26590739        .       +       0       ID=mikado.Chr5G2.1.CDS5;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590641        26590739        .       +       .       ID=mikado.Chr5G2.1.exon6;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26590880        26591092        .       +       0       ID=mikado.Chr5G2.1.CDS6;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26590880        26591092        .       +       .       ID=mikado.Chr5G2.1.exon7;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591174        26591236        .       +       0       ID=mikado.Chr5G2.1.CDS7;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591174        26591236        .       +       .       ID=mikado.Chr5G2.1.exon8;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591324        26591442        .       +       0       ID=mikado.Chr5G2.1.CDS8;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591324        26591442        .       +       .       ID=mikado.Chr5G2.1.exon9;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591520        26591578        .       +       1       ID=mikado.Chr5G2.1.CDS9;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591520        26591578        .       +       .       ID=mikado.Chr5G2.1.exon10;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26591681        26592002        .       +       2       ID=mikado.Chr5G2.1.CDS10;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26591681        26592002        .       +       .       ID=mikado.Chr5G2.1.exon11;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     CDS     26592528        26592561        .       +       1       ID=mikado.Chr5G2.1.CDS11;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     exon    26592528        26592561        .       +       .       ID=mikado.Chr5G2.1.exon12;Parent=mikado.Chr5G2.1
Chr5    Mikado_loci     gene    26592649        26595691        19      +       .       ID=mikado.Chr5G3;Name=mikado.Chr5G3;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26592720        26595691        19      +       .       ID=mikado.Chr5G3.1;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.1;alias=st_Stringtie_STAR.21710.7.split2;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20;canonical_number=20;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     CDS     26592720        26593365        .       +       0       ID=mikado.Chr5G3.1.CDS1;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26592720        26593365        .       +       .       ID=mikado.Chr5G3.1.exon1;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26593449        26593836        .       +       2       ID=mikado.Chr5G3.1.CDS2;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26593449        26593836        .       +       .       ID=mikado.Chr5G3.1.exon2;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26593930        26594062        .       +       1       ID=mikado.Chr5G3.1.CDS3;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26593930        26594062        .       +       .       ID=mikado.Chr5G3.1.exon3;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594172        26594243        .       +       0       ID=mikado.Chr5G3.1.CDS4;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594172        26594243        .       +       .       ID=mikado.Chr5G3.1.exon4;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594318        26594416        .       +       0       ID=mikado.Chr5G3.1.CDS5;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594318        26594416        .       +       .       ID=mikado.Chr5G3.1.exon5;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594569        26594772        .       +       0       ID=mikado.Chr5G3.1.CDS6;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594569        26594772        .       +       .       ID=mikado.Chr5G3.1.exon6;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26594860        26594922        .       +       0       ID=mikado.Chr5G3.1.CDS7;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26594860        26594922        .       +       .       ID=mikado.Chr5G3.1.exon7;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595003        26595121        .       +       0       ID=mikado.Chr5G3.1.CDS8;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595003        26595121        .       +       .       ID=mikado.Chr5G3.1.exon8;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595210        26595268        .       +       1       ID=mikado.Chr5G3.1.CDS9;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595210        26595268        .       +       .       ID=mikado.Chr5G3.1.exon9;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     CDS     26595366        26595691        .       +       2       ID=mikado.Chr5G3.1.CDS10;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     exon    26595366        26595691        .       +       .       ID=mikado.Chr5G3.1.exon10;Parent=mikado.Chr5G3.1
Chr5    Mikado_loci     mRNA    26592649        26595268        21      +       .       ID=mikado.Chr5G3.2;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.2;abundance=2.390309;alias=cl_Chr5.6283;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;primary=False
Chr5    Mikado_loci     exon    26592649        26593365        .       +       .       ID=mikado.Chr5G3.2.exon1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     five_prime_UTR  26592649        26592719        .       +       .       ID=mikado.Chr5G3.2.five_prime_UTR1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26592720        26593365        .       +       0       ID=mikado.Chr5G3.2.CDS1;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26593449        26593836        .       +       2       ID=mikado.Chr5G3.2.CDS2;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26593449        26593836        .       +       .       ID=mikado.Chr5G3.2.exon2;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26593930        26594095        .       +       1       ID=mikado.Chr5G3.2.CDS3;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26593930        26594095        .       +       .       ID=mikado.Chr5G3.2.exon3;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594172        26594243        .       +       0       ID=mikado.Chr5G3.2.CDS4;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594172        26594243        .       +       .       ID=mikado.Chr5G3.2.exon4;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594318        26594416        .       +       0       ID=mikado.Chr5G3.2.CDS5;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594318        26594416        .       +       .       ID=mikado.Chr5G3.2.exon5;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594569        26594772        .       +       0       ID=mikado.Chr5G3.2.CDS6;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594569        26594772        .       +       .       ID=mikado.Chr5G3.2.exon6;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26594860        26594922        .       +       0       ID=mikado.Chr5G3.2.CDS7;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26594860        26594922        .       +       .       ID=mikado.Chr5G3.2.exon7;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26595003        26595121        .       +       0       ID=mikado.Chr5G3.2.CDS8;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26595003        26595121        .       +       .       ID=mikado.Chr5G3.2.exon8;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     CDS     26595210        26595268        .       +       1       ID=mikado.Chr5G3.2.CDS9;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     exon    26595210        26595268        .       +       .       ID=mikado.Chr5G3.2.exon9;Parent=mikado.Chr5G3.2
Chr5    Mikado_loci     gene    26596207        26598231        20      +       .       ID=mikado.Chr5G4;Name=mikado.Chr5G4;multiexonic=False;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26596207        26598231        20      +       .       ID=mikado.Chr5G4.1;Parent=mikado.Chr5G4;Name=mikado.Chr5G4.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5    Mikado_loci     CDS     26596207        26598192        .       +       0       ID=mikado.Chr5G4.1.CDS1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     exon    26596207        26598231        .       +       .       ID=mikado.Chr5G4.1.exon1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     three_prime_UTR 26598193        26598231        .       +       .       ID=mikado.Chr5G4.1.three_prime_UTR1;Parent=mikado.Chr5G4.1
Chr5    Mikado_loci     gene    26599417        26601137        20      +       .       ID=mikado.Chr5G5;Name=mikado.Chr5G5;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5    Mikado_loci     mRNA    26599417        26601137        20      +       .       ID=mikado.Chr5G5.1;Parent=mikado.Chr5G5;Name=mikado.Chr5G5.1;abundance=0.371780;alias=cl_Chr5.6286;canonical_junctions=1,2,3,4,5,6;canonical_number=6;canonical_proportion=1.0;primary=True
Chr5    Mikado_loci     exon    26599417        26599654        .       +       .       ID=mikado.Chr5G5.1.exon1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     five_prime_UTR  26599417        26599612        .       +       .       ID=mikado.Chr5G5.1.five_prime_UTR1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26599613        26599654        .       +       0       ID=mikado.Chr5G5.1.CDS1;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26599767        26600053        .       +       0       ID=mikado.Chr5G5.1.CDS2;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26599767        26600053        .       +       .       ID=mikado.Chr5G5.1.exon2;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600151        26600244        .       +       1       ID=mikado.Chr5G5.1.CDS3;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600151        26600244        .       +       .       ID=mikado.Chr5G5.1.exon3;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600314        26600394        .       +       0       ID=mikado.Chr5G5.1.CDS4;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600314        26600394        .       +       .       ID=mikado.Chr5G5.1.exon4;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600497        26600616        .       +       0       ID=mikado.Chr5G5.1.CDS5;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600497        26600616        .       +       .       ID=mikado.Chr5G5.1.exon5;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600696        26600908        .       +       0       ID=mikado.Chr5G5.1.CDS6;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600696        26600908        .       +       .       ID=mikado.Chr5G5.1.exon6;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     CDS     26600987        26601085        .       +       0       ID=mikado.Chr5G5.1.CDS7;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     exon    26600987        26601137        .       +       .       ID=mikado.Chr5G5.1.exon7;Parent=mikado.Chr5G5.1
Chr5    Mikado_loci     three_prime_UTR 26601086        26601137        .       +       .       ID=mikado.Chr5G5.1.three_prime_UTR1;Parent=mikado.Chr5G5.1
###
Chr5    Mikado_loci     superlocus      26575364        26579730        .       -       .       ID=Mikado_superlocus:Chr5-:26575364-26579730;Name=superlocus:Chr5-:26575364-26579730
Chr5    Mikado_loci     ncRNA_gene      26575364        26579730        18      -       .       ID=mikado.Chr5G6;Name=mikado.Chr5G6;multiexonic=True;superlocus=Mikado_superlocus:Chr5-:26575364-26579730
Chr5    Mikado_loci     ncRNA   26575711        26579730        18      -       .       ID=mikado.Chr5G6.1;Parent=mikado.Chr5G6;Name=cl_Chr5.6271;abundance=1.141582;alias=cl_Chr5.6271;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;primary=True
Chr5    Mikado_loci     exon    26575711        26575797        .       -       .       ID=mikado.Chr5G6.1.exon1;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26575885        26575944        .       -       .       ID=mikado.Chr5G6.1.exon2;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26576035        26576134        .       -       .       ID=mikado.Chr5G6.1.exon3;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26576261        26577069        .       -       .       ID=mikado.Chr5G6.1.exon4;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577163        26577288        .       -       .       ID=mikado.Chr5G6.1.exon5;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577378        26577449        .       -       .       ID=mikado.Chr5G6.1.exon6;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26577856        26577937        .       -       .       ID=mikado.Chr5G6.1.exon7;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26578239        26578792        .       -       .       ID=mikado.Chr5G6.1.exon8;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579079        26579161        .       -       .       ID=mikado.Chr5G6.1.exon9;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579301        26579395        .       -       .       ID=mikado.Chr5G6.1.exon10;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     exon    26579602        26579730        .       -       .       ID=mikado.Chr5G6.1.exon11;Parent=mikado.Chr5G6.1
Chr5    Mikado_loci     ncRNA   26578496        26579563        13      -       .       ID=mikado.Chr5G6.3;Parent=mikado.Chr5G6;Name=tr_c73_g1_i1.mrna1.160;alias=tr_c73_g1_i1.mrna1.160;canonical_junctions=1;canonical_number=1;canonical_proportion=1.0;ccode=j;gene_name=c73_g1_i1;primary=False
Chr5    Mikado_loci     exon    26578496        26578518        .       -       .       ID=mikado.Chr5G6.3.exon1;Parent=mikado.Chr5G6.3
Chr5    Mikado_loci     exon    26579301        26579563        .       -       .       ID=mikado.Chr5G6.3.exon2;Parent=mikado.Chr5G6.3
Chr5    Mikado_loci     ncRNA   26575364        26578163        16      -       .       ID=mikado.Chr5G6.2;Parent=mikado.Chr5G6;Name=cuff_cufflinks_star_at.23553.1;alias=cuff_cufflinks_star_at.23553.1;fpkm=2.9700103727;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;conf_hi=3.260618;conf_lo=2.679403;cov=81.895309;frac=0.732092;primary=False
Chr5    Mikado_loci     exon    26575364        26575410        .       -       .       ID=mikado.Chr5G6.2.exon1;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575495        26575620        .       -       .       ID=mikado.Chr5G6.2.exon2;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575711        26575797        .       -       .       ID=mikado.Chr5G6.2.exon3;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26575885        26575944        .       -       .       ID=mikado.Chr5G6.2.exon4;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26576035        26576134        .       -       .       ID=mikado.Chr5G6.2.exon5;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26576261        26577069        .       -       .       ID=mikado.Chr5G6.2.exon6;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577163        26577288        .       -       .       ID=mikado.Chr5G6.2.exon7;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577378        26577449        .       -       .       ID=mikado.Chr5G6.2.exon8;Parent=mikado.Chr5G6.2
Chr5    Mikado_loci     exon    26577856        26578163        .       -       .       ID=mikado.Chr5G6.2.exon9;Parent=mikado.Chr5G6.2
###
Things to note:
  • multiple RNAs for the same gene are identified by progressive enumeration after a “.” (eg. mikado.Chr5G5.1, mikado.Chr5G5.2, etc.).
  • All RNAs retain their old name under the attribute “alias”. If a transcript was split due to the presence of multiple ORFs, its alias will end with “.split<progressive ID>”.
  • RNAs have the boolean attribute “primary”, which identifies them as the primary transcript of the gene or as an alternative splicing isoform.
  • Non-primary RNAs have the additional “ccode” field, which identifies the class code assigned to them when they were compared to the primary transcript.
  • multiexonic RNAs have the attributes “canonical_junctions”, “canonical_number”, and “canonical_proportion” assigned to them. These properties are calculated by Mikado during the prepare stage.
Metrics files

These are tabular files that enumerate all the metrics raw values for each transcript. This is the section of the metrics file corresponding to the GFF3 file above:

tid parent  score   best_bits       blast_score     canonical_intron_proportion     cdna_length     cds_not_maximal cds_not_maximal_fraction        combined_cds_fraction   combined_cds_intron_fractioncombined_cds_length combined_cds_num        combined_cds_num_fraction       combined_utr_fraction   combined_utr_length     end_distance_from_junction      end_distance_from_tes   exon_fraction   exon_num        five_utr_length five_utr_num    five_utr_num_complete   has_start_codon has_stop_codon  highest_cds_exon_number highest_cds_exons_num   intron_fraction is_complete     max_intron_length       min_intron_length       non_verified_introns_num        num_introns_greater_than_max    num_introns_smaller_than_min    number_internal_orfs    proportion_verified_introns     proportion_verified_introns_inlocusretained_fraction    retained_intron_num     selected_cds_exons_fraction     selected_cds_fraction   selected_cds_intron_fraction    selected_cds_length     selected_cds_num        selected_cds_number_fraction    selected_end_distance_from_junction     selected_end_distance_from_tes  selected_start_distance_from_tss        snowy_blast_score       source_score    start_distance_from_tss three_utr_length        three_utr_num   three_utr_num_complete  utr_fraction    utr_length      utr_num utr_num_complete        verified_introns_num
mikado.Chr5G1.2     mikado.Chr5G1   19.0    1086.25 1086.25 1.0     1927    0       0.0     0.87    1.0     1683    10      0.91    0.13    244     0       157     0.92    11      87      2       1       TrueTrue        10      10      0.91    True    340     71      10      0       0       1       0.0     0       0.0     0       0.91    0.87    1.0     1683    10      0.91    0       157     87      13.78   0       87      157     1       0       0.13    244     3       1       0
mikado.Chr5G1.1     mikado.Chr5G1   21.89   1086.63 1086.63 1.0     1937    0       0.0     0.87    1.0     1683    10      0.91    0.13    254     0       157     0.92    11      97      2       1       TrueTrue        10      10      0.91    True    196     71      10      0       0       1       0.0     0       0.0     0       0.91    0.87    1.0     1683    10      0.91    0       157     97      13.78   0       97      157     1       0       0.13    254     3       1       0
mikado.Chr5G2.2     mikado.Chr5G2   19.04   1140.95 1140.95 1.0     2197    0       0.0     0.88    1.0     1938    11      0.92    0.12    259     0       0       0.92    12      259     2       1       TrueTrue        11      11      0.92    True    577     74      11      0       0       1       0.0     0       0.0     0       0.92    0.88    1.0     1938    11      0.92    0       0       259     16.66   0       259     0       0       0       0.12    259     2       1       0
mikado.Chr5G2.1     mikado.Chr5G2   20.06   1140.95 1140.95 1.0     2204    0       0.0     0.88    1.0     1938    11      0.92    0.12    266     0       0       0.92    12      266     2       1       TrueTrue        11      11      0.92    True    570     74      11      0       0       1       0.0     0       0.0     0       0.92    0.88    1.0     1938    11      0.92    0       0       266     16.66   0       266     0       0       0       0.12    266     2       1       0
mikado.Chr5G3.2     mikado.Chr5G3   8.59    1193.72 1193.72 1.0     1887    0       0.0     0.96    0.8     1816    9       1.0     0.04    71      0       0       0.75    9       71      1       0       TrueFalse       9       9       0.8     False   152     74      8       0       0       1       0.0     0       0.0     0       1.0     0.96    0.8     1816    9       1.0     0       0       71      14.16   0       71      0       0       0       0.04    71      1       0       0
mikado.Chr5G3.1     mikado.Chr5G3   19.0    1353.19 1353.19 1.0     2109    0       0.0     1.0     0.9     2109    10      1.0     0.0     0       0       0       0.83    10      0       0       0       TrueTrue        10      10      0.9     True    152     74      9       0       0       1       0.0     0       0.0     0       1.0     1.0     0.9     2109    10      1.0     0       0       0       16.66   0       00      0       0       0.0     0       0       0       0
mikado.Chr5G4.1     mikado.Chr5G4   20.0    1258.43 1258.43 1.0     2025    0       0.0     0.98    0       1986    1       1.0     0.02    39      0       39      1.0     1       0       0       0       TrueTrue        1       1       0       True    0       0       0       0       0       1       0       0       0.0     0       1.0     0.98    0       1986    1       1.0     0       39      0       16.66   0       039     1       0       0.02    39      1       0       0
mikado.Chr5G5.1     mikado.Chr5G5   20.0    565.46  565.46  1.0     1184    0       0.0     0.79    1.0     936     7       1.0     0.21    248     0       52      1.0     7       196     1       0       TrueTrue        7       7       1.0     True    112     69      6       0       0       1       0.0     0       0.0     0       1.0     0.79    1.0     936     7       1.0     0       52      196     13.67   0       196     52      1       0       0.21    248     2       0       0
mikado.Chr5G6.2     mikado.Chr5G6   17.1    0       0       1.0     1735    0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.56    9       0       0       0       False   False   0       0       0.62    False   406     84      0       0       0       0       1.0     0.89    0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       8
mikado.Chr5G6.1     mikado.Chr5G6   17.9    0       0       1.0     2197    0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.69    11      0       0       0       False   False   0       0       0.77    False   406     87      3       0       0       0       0.9     1.0     0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       9
mikado.Chr5G6.3     mikado.Chr5G6   13.0    0       0       1.0     286     0       0       0.0     0       0       0       0.0     1.0     0       0       0       0.12    2       0       0       0       False   False   0       0       0.08    False   782     782     1       0       0       0       0.0     0.0     0.0     0       0.0     0.0     0       0       0       0.0     0       0       0       0       00      0       0       0       1.0     0       0       0       0

As it can be noted, metrics can assume values in a very wide range. We direct you to the metrics section of the documentation for further details.

Scoring files

This file contains the scores assigned to each metric for each transcript. Only metrics which have been used for the scoring will be present. This is the section of the metrics file corresponding to the above GFF3 file:

tid parent  score   blast_score     cdna_length     cds_not_maximal cds_not_maximal_fraction        combined_cds_fraction   combined_cds_intron_fraction    combined_cds_length     combined_cds_num        end_distance_from_junction      exon_fraction   exon_num        five_utr_length five_utr_num    highest_cds_exon_number intron_fraction number_internal_orfs    proportion_verified_introns     retained_fraction       retained_intron_num     selected_cds_fraction   selected_cds_intron_fraction    selected_cds_length     selected_cds_num        source_score    three_utr_length        three_utr_num
mikado.Chr5G1.2     mikado.Chr5G1   19.0    0.0     0.0     1       1       0.0     1       1       1       1       1       1       0.0     1.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G1.1     mikado.Chr5G1   21.89   1.0     1.0     1       1       0.06    1       1       1       1       1       1       0.77    1.0     1       1       1.0     1       1       1       0.06    1       11      0       0.0     1.0
mikado.Chr5G2.2     mikado.Chr5G2   19.04   1       0.0     1       1       0.0     1       1       1       1       1       1       0.04    1.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G2.1     mikado.Chr5G2   20.06   1       1.0     1       1       0.03    1       1       1       1       1       1       0.0     1.0     1       1       1.0     1       1       1       0.03    1       11      0       0.0     0.0
mikado.Chr5G3.1     mikado.Chr5G3   19.0    1.0     1.0     1       1       0.0     1.0     1.0     1.0     1       1.0     1.0     0.0     0.0     1.0     1.0     1.0     1       1       1       0.0     1.01.0  1.0     0       0.0     0.0
mikado.Chr5G3.2     mikado.Chr5G3   8.59    0.0     0.0     1       1       0.19    0.0     0.0     0.0     1       0.0     0.0     0.71    0.5     0.0     0.0     1.0     1       1       1       0.19    0.00.0  0.0     0       0.0     0.0
mikado.Chr5G4.1     mikado.Chr5G4   20.0    1       1       1       1       0.0     1       1       1       1       1       1       0.0     0.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G5.1     mikado.Chr5G5   20.0    1       1       1       1       0.0     1       1       1       1       1       1       0.0     0.0     1       1       1.0     1       1       1       0.0     1       11      0       0.0     1.0
mikado.Chr5G6.3     mikado.Chr5G6   13.0    1       0.0     1       1       0.0     1       1       1       1       0.0     0.0     0.0     0.0     1       0.0     0.0     0.0     1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G6.2     mikado.Chr5G6   17.1    1       0.76    1       1       0.0     1       1       1       1       0.78    0.78    0.0     0.0     1       0.78    0.0     1.0     1       1       0.0     1       11      0       0.0     0.0
mikado.Chr5G6.1     mikado.Chr5G6   17.9    1       1.0     1       1       0.0     1       1       1       1       1.0     1.0     0.0     0.0     1       1.0     0.0     0.9     1       1       0.0     1       11      0       0.0     0.0

The final score value is obtained by summing all the individual metrics.

Important

If you compare the scores assigned to transcripts at the loci level with those assigned at the subloci level, you will notice that the scores are different and that even some of the raw metrics values are. The former phenomenon is due to the fact that the Mikado scoring system is not absolute but relative; the latter, to the fact that some metrics are locus-dependent, ie their values change due the presence or absence of other transcripts. A typical example is given by the “retained_intron” metrics; retained introns are identified by looking for non-coding regions of transcript which fall inside the intron of another transcript. Changing the transcripts in the locus will change the value associated to this metric, as non-coding sections will or will not be classified as “retained introns”, and therefore the score associated with both the metric and the transcript.

Usage

mikado pick allows to modify some of the parameters regarding the run at runtime. However, some sections - such as most of the settings regarding alternative splicing - are left untouched by the utility, and are best modified by editing the configuration file itself. The available parameters are as follows:

  • json-conf: required. This is the configuration file created in the first step of the pipeline.
  • gff; optionally, it is possible to point Mikado prepare to the GTF it should use here on the command line. This file should be the output of the preparation step. Please note that this file should be in GTF format, sorted by chromosome and position; if that is not the case, Mikado will fail.
  • db: Optionally, it is possible to specify the database to Mikado on the command line, rather than on the configuration file. Currently, this option supports SQLite databases only.
  • Options related to how Mikado will treat the data:
    • intron_range: this option expects a couple of positive integers, in ascending order, indicating the 98% CI where most intron lengths should fall into. Gene models with introns whose lengths fall outside of this range might be penalized, depending on the scoring system used. If uncertain, it is possible to use the included stats utility on the gene annotation of a closely related species.
    • purge: flag. If set, Mikado will not just identify putative fragments - it will completely exclude them from the output.
    • flank: for the purposes of identifying fragments, it is useful to consider together loci which are not necessarily overlapping but which are lying relatively near on the genome sequence. This parameter (a positive integer) specifies the maximum distance for Mikado for gathering data together for this purpose.
    • mode: how Mikado will treat BLAST and ORF data in the presence of putative chimeras. See the relevant section in the configuration page for details.
  • Options regarding the output files:
    • output-dir: Output directory. By default, Mikado will write all files and the log on the current directory.
    • loci_out: required. This it the main output file, in GFF format.
    • prefix: this ID will be prefixed to all gene and transcript models. IN general, IDs will be of the form “<prefix>.<chromosome><progressive ID>”. Default: Mikado.
    • source: source field prefix for the output files. Useful for eg loading Mikado runs into WebApollo [Apollo].
    • no_cds: if present, this flg will indicate to Mikado not to print out the CDS of selected models but only their transcript structures.
    • subloci_out: If requested, Mikado can output the data regarding the first intermediate step, ie the subloci. See the introduction for details.
    • monoloci_out: If requested, Mikado can output the data regarding the second intermediate step, ie the monosubloci. See the introduction for details.
  • Options regarding the resources to be used:
    • procs: number of processors to use.
    • start-method: multiprocessing start method. See the explanation on Python multiprocessing
    • single: flag. If present, multiprocessing will be disabled.
    • shared-memory: flag, available on Unix systems only. If set, Mikado will try to copy the SQLite database in RAM. It might provide a small boost if disk access is a limitation. Ineffective with databases other than SQLite.
    • shared-memory-db: if the database has already been copied in memory, its new location can be given with this argument. Useful to have multiple Mikado picks share the same DB.
    • preload: flag. If present, Mikado will load the database in memory before execution. Discouraged unless the database size is quite small.
  • Options regarding logging:
    • log: name of the log file. By default, “pick.log”
    • verbose: sets the log level to DEBUG. Please be advised that the debug mode is extremely verbose and is bestly invoked only for real, targeted debugging sessions.
    • noverbose: sets the log level to ERROR. If set, in most cases, the log file will be practically empty.
    • log-level: this flag directly sets the log level. Available values: DEBUG, INFO, WARNING, ERROR.

Usage:

$ mikado pick --help
usage: Mikado pick [-h] [--start-method {fork,spawn,forkserver}] [-p PROCS]
                   --json-conf JSON_CONF [--scoring-file SCORING_FILE]
                   [-i INTRON_RANGE INTRON_RANGE] [--pad]
                   [--subloci_out SUBLOCI_OUT] [--monoloci_out MONOLOCI_OUT]
                   [--loci_out LOCI_OUT] [--prefix PREFIX] [--no_cds]
                   [--source SOURCE] [--flank FLANK] [--purge]
                   [--subloci-from-cds-only] [--monoloci-from-simple-overlap]
                   [-db SQLITE_DB] [-od OUTPUT_DIR] [--single] [-l LOG]
                   [-v | -nv] [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                   [--mode {nosplit,stringent,lenient,permissive,split}]
                   [gff]

positional arguments:
  gff

optional arguments:
  -h, --help            show this help message and exit
  --start-method {fork,spawn,forkserver}
                        Multiprocessing start method. (default: None)
  -p PROCS, --procs PROCS
                        Number of processors to use. Default: look in the
                        configuration file (1 if undefined) (default: None)
  --json-conf JSON_CONF
                        JSON/YAML configuration file for Mikado. (default:
                        None)
  --scoring-file SCORING_FILE
                        Optional scoring file for the run. It will override
                        (default: None)
  -i INTRON_RANGE INTRON_RANGE, --intron-range INTRON_RANGE INTRON_RANGE
                        Range into which intron lengths should fall, as a
                        couple of integers. Transcripts with intron lengths
                        outside of this range will be penalised. Default: (60,
                        900) (default: None)
  --pad                 Whether to pad transcripts in loci. (default: False)
  --subloci_out SUBLOCI_OUT
  --monoloci_out MONOLOCI_OUT
  --loci_out LOCI_OUT   This output file is mandatory. If it is not specified
                        in the configuration file, it must be provided here.
                        (default: None)
  --prefix PREFIX       Prefix for the genes. Default: Mikado (default: None)
  --no_cds              Flag. If set, not CDS information will be printed out
                        in the GFF output files. (default: None)
  --source SOURCE       Source field to use for the output files. (default:
                        None)
  --flank FLANK         Flanking distance (in bps) to group non-overlapping
                        transcripts into a single superlocus. Default:
                        determined by the configuration file. (default: None)
  --purge               Flag. If set, the pipeline will suppress any loci
                        whose transcripts do not pass the requirements set in
                        the JSON file. (default: False)
  --subloci-from-cds-only
                        "Flag. If set, Mikado will only look for overlap in
                        the coding features when clustering transcripts
                        (unless one transcript is non-coding, in which case
                        the whole transcript will be considered). Default:
                        False, Mikado will consider transcripts in their
                        entirety. (default: False)
  --monoloci-from-simple-overlap
                        "Flag. If set, in the final stage Mikado will cluster
                        transcripts by simple overlap, not by looking at the
                        presence of shared introns. Default: False. (default:
                        False)
  -db SQLITE_DB, --sqlite-db SQLITE_DB
                        Location of an SQLite database to overwrite what is
                        specified in the configuration file. (default: None)
  -od OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory. Default: current working directory
                        (default: None)
  --single              Flag. If set, Creator will be launched with a single
                        process. Useful for debugging purposes only. (default:
                        False)
  --mode {nosplit,stringent,lenient,permissive,split}
                        Mode in which Mikado will treat transcripts with
                        multiple ORFs. - nosplit: keep the transcripts whole.
                        - stringent: split multi-orf transcripts if two
                        consecutive ORFs have both BLAST hits and none of
                        those hits is against the same target. - lenient:
                        split multi-orf transcripts as in stringent, and
                        additionally, also when either of the ORFs lacks a
                        BLAST hit (but not both). - permissive: like lenient,
                        but also split when both ORFs lack BLAST hits - split:
                        split multi-orf transcripts regardless of what BLAST
                        data is available. (default: None)

Log options:
  -l LOG, --log LOG     File to write the log to. Default: decided by the
                        configuration file. (default: None)
  -v, --verbose         Flag. If set, the debug mode will be activated.
                        (default: False)
  -nv, --noverbose      Flag. If set, the log will report only errors and
                        critical events. (default: False)
  -lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level. Default: retrieved by the configuration
                        file. (default: None)
Technical details

mikado pick uses a divide-et-impera algorithm to find and analyse loci separately. As the data to be integrated with the transcripts is stored on the database rather than be calculated on the fly, rerunning pick with different options takes little time and resources. To keep the data sorted, Mikado will write out temporary files during the operation and merge them at the end of the run (see function merge_loci_gff in the picking module.

The Mikado pipeline is composed of four different stages, that have to be executed serially:

  1. Mikado configure, for creating the configuration file that will be used throughout the run.
  2. Mikado prepare, for collapsing the input assemblies into a single file. After this step, it is possible to perform additional analyses on the data such as TransDecoder (highly recommended), Portcullis, or BLAST.
  3. Mikado serialise, to gather all external data into a single database.
  4. Mikado pick, to perform the actual selection of the best transcripts in each locus.
Compare
Compare
Overview
This Mikado utility allows the user to compare the transcripts from any two annotations. Its output allows:
  • To understand which reference transcript each prediction is most similar to
  • To understand which prediction transcript best represent each reference model
  • To have a summary information about the similarity between the two annotations.

Mikado compare has been directly inspired by the popular Cuffcompare [Cufflinks] utility and by ParsEval [ParsEval]. Please note that while superficially similar to Cuffcompare in the style of the output files, Mikado compare is more philosophically similar to ParsEval, as it will not try to aggregate transcripts in loci but will perform a pure comparison between the two annotation files. Both GTF and GFF files are accepted, in any combination.

Usage

Mikado compare is invoked by specifying the reference annotation and the desired mode of analysis. There are three possible options:

  1. In its default mode, compare will ask for a prediction annotation to compare the reference against.
  2. In the “self” mode, compare will do a self-comparison of the reference against itself, excluding as possible results the matches between a transcript and itself. It can be useful to glean the relationships between transcripts and genes in an annotation.
  3. In the “internal” mode of operations, compare will again perform a self-comparison, focussed on multi-isoform genes. For those, compare will perform and report all possible comparisons. It is useful to understand the relationships between the transcripts in a single locus.

Mikado stores the information of the reference in a specialised index, with a “.midx” suffix, which will be created by the program upon its first execution with a new reference. If the index file is already present, Mikado will try to use it rather than read again the annotation.

Command line
$ mikado compare --help
usage: Mikado compare [-h] -r REFERENCE
                      (-p PREDICTION | --self | --internal | --index)
                      [--distance DISTANCE] [-pc] [-o OUT] [--lenient] [-eu]
                      [-n] [-l LOG] [-v]

optional arguments:
  -h, --help            show this help message and exit
  --distance DISTANCE   Maximum distance for a transcript to be considered a
                        polymerase run-on. Default: 2000
  -pc, --protein-coding
                        Flag. If set, only transcripts with a CDS (both in
                        reference and prediction) will be considered.
  -o OUT, --out OUT     Prefix for the output files. Default: mikado_compare
  --lenient             If set, exonic statistics will be calculated leniently
                        in the TMAP as well - ie they will consider an exon as
                        match even if only the internal junction has been
                        recovered.
  -eu, --exclude-utr    Flag. If set, reference and prediction transcripts
                        will be stripped of their UTRs (if they are coding).
  -n, --no-index, --no-save-index
                        Unless this flag is set, compare will save an index of
                        the reference to quicken multiple calls.
  -l LOG, --log LOG
  -v, --verbose

Prediction and annotation files.:
  -r REFERENCE, --reference REFERENCE
                        Reference annotation file. By default, an index will
                        be crated and saved with the suffix ".midx".
  -p PREDICTION, --prediction PREDICTION
                        Prediction annotation file.
  --self                Flag. If set, the reference will be compared with
                        itself. Useful for understanding how the reference
                        transcripts interact with each other.
  --internal            Flag. If set, for each gene with more than one
                        transcript isoform each will be compared to the
                        others. Useful for understanding the structural
                        relationships between the transcripts in each gene.
  --index               Flag. If set, compare will stop after having generated
                        the GFF index for the reference.
Output files

Mikado compare produces two tabular files, tmap and refmap, and one statistics file.

TMAP files

TMAP are tabular files that store the information regarding the best match for each prediction in the reference. The columns are as follows:

  1. ref_id: Transcript ID of the matched reference model(s).
  2. ref_gene: Gene ID of the matched reference model(s).
  3. ccode: class code of the match. See the relevant section on Class codes.
  4. tid: Transcript ID of the prediction model.
  5. gid: Gene ID of the prediction model.
  6. tid_num_exons: Number of exons of the prediction model.
  7. ref_num_exons: Number of exons of the reference model.
  8. n_prec: Nucleotide precision of the prediction ( TP / (length of the prediction))
  9. n_recall: Nucleotide recall of the reference (TP / (length of the reference))
  10. n_f1: F1 of recall and precision at the nucleotide level.
  11. j_prec: Splice junction precision of the prediction model ( TP / (number of splice sites in the prediction))
  12. j_recall: Splice junction recall of the reference model ( TP / (number of splice sites in the reference))
  13. j_f1: F1 of recall and precision at the splice junction level.
  14. e_prec: Exon precision of the prediction model ( TP / (number of exons in the prediction)). NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  15. e_recall: Exon recall of the reference model ( TP / (number of exons in the reference))
  16. e_f1: F1 of recall and precision at the exon level.
  17. distance: Distance of the model from its putative match.
  18. location: location of the match, with the format <chromosome>:<start>..<end>

An example of TMAP file is as follows:

ref_id      ref_gene        ccode   tid     gid     tid_num_exons   ref_num_exons   n_prec  n_recall        n_f1    j_prec  j_recall        j_f1    e_prec  e_recall        e_f1    distance        location
AT5G66600.2 AT5G66600       =       cuff_cufflinks_star_at.23553.1  cuff_cufflinks_star_at.23553.1.gene     9       9       91.30   81.31   86.02   100.00  100.00  100.00  77.78   77.78   77.78   0       Chr5:26575000..26578163
AT5G66600.2 AT5G66600       C       cl_Chr5.6272    cl_Chr5.6272.gene       7       9       94.95   72.43   82.18   100.00  75.00   85.71   85.71   66.67   75.00   0       Chr5:26575000..26578087
AT5G66620.1,AT5G66630.1,AT5G66631.1 AT5G66620,AT5G66630,AT5G66631   f,j,j,G st_Stringtie_STAR.21710.15      st_Stringtie_STAR.21710.15.gene 8       11,10,1 19.13,19.95,35.98       54.57,45.65,100.00      28.33,27.76,52.92       28.57,64.29,0.00        20.00,50.00,0.00        23.53,56.25,0.00        12.50,37.50,0.00        9.09,30.00,0.00 10.53,33.33,0.00        0       Chr5:26588402..26598231

You can notice that the third example is particular as the prediction transcript matches not one but multiple reference transcripts. This is a fusion event.

RefMap files

RefMap files are tabular files which store the information regarding the best match for each reference transcript, among all possible prediction models. The columns of the file are as follows:

  1. ref_id: Transcript ID of the reference model.
  2. ccode: class code of the match. See the relevant section on Class codes.
  3. tid: Transcript ID of the prediction model.
  4. gid: Gene ID of the prediction model.
  5. nF1: F1 of recall and precision at the nucleotide level.
  6. jF1: F1 of recall and precision at the splice junction level.
  7. eF1: F1 of recall and precision at the exon level. NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  8. ref_gene: Gene ID of the reference model.
  9. best_ccode: Best possible class code found for any of the transcripts of the gene.
  10. best_tid: Transcript ID of the prediction model which fit best one of the transcript models of the reference gene.
  11. best_gid: Gene ID of the prediction model which fit best one of the transcript models of the reference gene.
  12. best_nF1: F1 of recall and precision at the nucleotide level, for the best possible comparison.
  13. best_jF1: F1 of recall and precision at the splice junction level, for the best possible comparison.
  14. best_eF1: F1 of recall and precision at the exon level, for the best possible comparison.
  15. location: location of the match, with the format <chromosome>:<start>..<end>

An example of a RefMap file is as follows:

ref_id      ccode   tid     gid     nF1     jF1     eF1     ref_gene        best_ccode      best_tid        best_gid        best_nF1        best_jF1        best_eF1    location
AT5G66610.1 =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584780..26587912
AT5G66610.2 J       mikado.Chr5G4.2 mikado.Chr5G4   93.91   94.74   76.19   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584774..26587912
AT5G66620.1 j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   AT5G66620       j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   Chr5:26588402..26592423
AT5G66630.1 n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   AT5G66630       n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   Chr5:26591981..26595922

Please note that the third example (AT5G66630.1) has as best possible match a fusion event.

Stats files

These files provide a summary of the comparison between the reference and the annotation. An example is as follows:

Command line:
/usr/users/ga002/venturil/py351/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in  8 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 85.74  64.73  73.77
            Exon level (stringent): 63.83  42.86  51.28
              Exon level (lenient): 80.00  52.94  63.72
                      Intron level: 89.47  59.65  71.58
                Intron chain level: 33.33  14.29  20.00
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  13.33  18.18
  Transcript level (>=80% base F1): 42.86  20.00  27.27
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  25.00  30.77
        Gene level (>=80% base F1): 60.00  37.50  46.15

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 2
             Matched intron chains: 2
   Matching monoexonic transcripts: 1
    Matched monoexonic transcripts: 1
        Total matching transcripts: 3
         Total matched transcripts: 3

          Missed exons (stringent): 17/47  (36.17%)
           Novel exons (stringent): 40/70  (57.14%)
            Missed exons (lenient): 9/45  (20.00%)
             Novel exons (lenient): 32/68  (47.06%)
                    Missed introns: 4/38  (10.53%)
                     Novel introns: 23/57  (40.35%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 6/15  (40.00%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 2/8  (25.00%)

The first section of the file describes:

  1. Concordance of the two annotations at the base level (recall, precision, and F1)
  2. Concordance of the two annotation at the exonic level (recall, precision, and F1), in two ways:
    • “stringent”: only perfect exonic matches are considered.
    • “lenient”: in this mode, terminal exons are counted as a match if the internal border is matched. See the RGASP paper [RGASP] for details on the rationale.
  3. Concordance of the two annotations at the intron level.
  4. Concordance of the two annotations at the intron chain level - how many intron chains of the reference are found identical in the prediction. Only multiexonic models are considered for this level.
  5. Concordance of the two annotations at the transcript level, in three different modes:
    • “stringent”: in this mode, only perfect matches are considered.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
  6. Concordance of the two annotations at the gene level, in three different modes:
    • “stringent”: in this mode, we consider reference genes for which it was possible to find at least one perfect match for one of its transcripts.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.

In the second section, the file reports how many of the intron chains, monoexonic transcripts and total transcripts in the reference were matched by at least one matching prediction transcript. Finally, in the third section the file reports the number of missed (present in the reference but not in the prediction) or novel (viceversa - present in the prediction but not in the reference) features.

Note

Please note that a gene might be considered as “found” even if its best match is intronic, on the opposite strand, or not directly overlapping it, or is in the opposite strand (see next section, in particular the Intronic, Fragment and No overlap categories).

Class codes

In addition to recall, precision and F1 values, Mikado assign each comparison between two transcripts a class code, which summarises the relationship between the two transcripts. The idea is lifted from the popular tool Cuffcompare, although Mikado greatly extends the catalogue of possible class codes. All class codes fall within one of the following categories:

  • Match: class codes of this type indicate concordance between the two transcript models.
  • Extension: class codes of this type indicate that one of the two models extends the intron chain of the other, without internal interruptions. The extension can be from either perspective - either the prediction extends the reference, or it is instead contained within the reference (so that switching perspectives, the reference would “extend” the prediction).
  • Alternative splicing: the two exon chains overlap but differ in significant ways.
  • Intronic: either the prediction is completely contained within the introns of the reference, or viceversa.
  • Overlap: the two transcript models generically overlap on their exonic sequence.
  • Fragment: the prediction is a fragment of the reference, in most cases because they are on opposite strands.
  • No overlap: the prediction and the reference are near but do not directly overlap.
  • Fusion: this special class code is a qualifier and it never appears on its own. When a transcript is defined as a fusion, its class code in the tmap file will be an “f” followed by the class codes of the individual transcript matches, sperated by comma. So a prediction which matches two reference models, one with a “j” and another with a “o”, will have a class code of “f,j,o”. In the refmap file, if the fusion is the best match, the class code will be “f” followed by the class code for the individual reference transcript; e.g., “f,j”

Available class codes

Class code Definition Reference multiexonic? Prediction multiexonic? Nucleotide: RC, PC, F1 Junction: RC, PC, F1 Reverse Category
               
= Complete intron chain match. True True NA 100%, 100%, 100% = Match
_ Complete match between two monoexonic transcripts. False False NA, NA, >=80% NA _ Match
n Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites outside of the reference transcript boundaries. True True 100%, < 100%, <100% 100%, < 100%, <100% c Extension
J Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites inside of the reference transcript boundaries. True True 100%, <= 100%, <100% 100%, < 100%, <100% C Extension
c The prediction is either multiexonic and with its intron chain completely contained within that of the reference, or monoexonic and contained within one of the reference exons. NA NA < 100%, 100%, NA < 100%, 100%, NA n Extension
C The prediction intron chain is completely contained within that of the reference transcript, but it partially debords either into its introns or outside of the reference boundaries. True True <= 100%, < 100%, < 100% < 100%, 100%, < 100% J or j Extension
j Alternative splicing event. True True NA <= 100%, 100%, < 100% j or C Alternative splicing
h Structural match between two models where where no splice site is conserved but at least one intron of the reference and one intron of the prediction partially overlap. True True > 0%, > 0%, > 0% 0%, 0%, 0% h Alternative splicing
g The monoexonic prediction overlaps one or more exons of the reference transcript; the borders of the prediction cannot fall inside the introns of the reference. The prediction transcript can bridge multiple exons of the reference model. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Alternative splicing
G Generic match of a multiexonic prediction transcript versus a monoexonic reference. False True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% g Alternative splicing
o Generic overlap between two multiexonic transcripts, which do not share any overlap among their introns. True True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% o Overlap
e Single exon transcript overlapping one reference exon and at least 10 bps of a reference intron, indicating a possible pre-mRNA fragment. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Overlap
m Generic match between two monoexonic transcripts. False False NA, NA, < 80% NA m Overlap
i Monoexonic prediction completely contained within one intron of the reference transcript. True False 0%, 0%, 0% 0%, 0%, 0% ri Intronic
I Prediction completely contained within the introns of the reference transcript. True True 0%, 0%, 0% 0%, 0%, 0% rI Intronic
ri Reverse intron transcript - the monoexonic reference is completely contained within one intron of the prediction transcript. False True 0%, 0%, 0% 0%, 0%, 0% i Intronic
rI Multiexonic reference completely contained within the introns of the prediction transcript. True True 0%, 0%, 0% 0%, 0%, 0% I Intronic
f Fusion - this special code is applied when a prediction intersects more than one reference transcript. To be considered for fusions, candidate references must either share at least one splice junction with the prediction, or have at least 10% of its bases recalled. If two or more reference transcripts fit these constraints, then the prediction model is classified as a fusion. NA NA > 10%, 0%, 0% > 0%, 0%, 0% NA Fusion
x Monoexonic match on the opposite strand. NA False >0%, >0%, >0% 0%, 0%, 0% x or X Fragment
X Multiexonic match on the opposite strand. NA True >0%, >0%, >0% NA x or X Fragment
p The prediction is on the same strand of a neighbouring but non-overlapping transcript. Probable polymerase run-on NA NA 0%, 0%, 0% 0%, 0%, 0% p Fragment
P The prediction is on the opposite strand of a neighbouring but non- overlapping transcript. Probable polymerase run-on. NA NA 0%, 0%, 0% 0%, 0%, 0% P Fragment
u Unknown - no suitable model has been found near enough the prediction to perform a comparison. NA NA 0%, 0%, 0% 0%, 0%, 0% NA Unknown
Technical details

Mikado compare conceptualizes the reference annotation as a collection of interval trees, one per chromosome or scaffold, where each node corresponds to an array of genes at the location. The gene and transcript objects are stored separately. The location of each transcript model in the prediction is queried against the tree, with a padding (default 2kbps) to allow for neighouring but non-overlapping genes, and the transcript itself is subsequently compared with each reference transcript contained in the hits. Each comparison will yield precision, recall and F1 values for the nucleotide, splice junction and exonic levels, together with an associated class code. The best match for the prediction is selected for by choosing the comparison yielding the best splice junction F1 and the best nucleotide F1, in this order. If the prediction transcript overlaps two or more genes on the same strand, and for at least two it has one match each with either 10% nucleotide recall or junction recall over 0%, it is deemed as a fusion event, and its line in the tmap file will report the best match against each of the fused genes, separated by comma.

Each calculated match against a reference transcript is stored as a potential best match for the reference transcript. At the end of the run, the hits for each reference transcript will be ordered using the following function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@staticmethod
def result_sorter(result):

    """
    Method to sort the results for the refmap. Order:
    - CCode does not contain "x", "P", "p" (i.e. fragments on opposite strand or
    polymerase run-on fragments)
    - Exonic F1 (e_f1)
    - Junction F1 (j_f1)
    - "f" in ccode (i.e. transcript is a fusion)
    - Nucleotide F1 (n_f1)

    :param result: a resultStorer object
    :type result: ResultStorer
    :return: (int, float, float, float)
    """

    bad_ccodes = ["x", "X", "P", "p"]
    bad_ccodes = set(bad_ccodes)

    orderer = (len(set.intersection(bad_ccodes, set(result.ccode))) == 0,
               result.j_f1, result.e_f1,
               result.n_f1,
               "f" in result.ccode)

    return orderer

This function is used to select both for the best match for the transcript, as well as to select among these matches for the best match for the gene.

The interval tree data structure is created using Cython code originally part of the bx-python, kindly provided by Dr. Taylor for modification and inclusion in Mikado. The code has been slightly modified for making it Python3 compliant.

The .midx files storing the annotation for repeated compare runs are gzip-compressed JSON files. In them, Mikado will store for each gene its coordinates, its transcripts, and the location of exons and CDS features. MIDX files make repeated runs quite faster, as the program will not have to re-parse the GFF file.

The comparison code is written in Cython and is crucial during the picking phase of Mikado, not just for the functioning of the comparison utility.

Mikado also comprises a dedicated utility, Mikado compare, to assess the similarity between two annotations.

Daijin
The Daijin pipeline for driving Mikado

snake_badge

No emperor or empress can lead its nation without a trusty chancellor to help him or her in organising the bureaucracy. Daijin, the Japanese minister, has the same role in Mikado - it smooths the path to go from a collection of read inputs (both RNA-Seq or long reads) to a polished transcriptome assembly. The pipeline is based on Snakemake [Snake]; while Snakemake can support any scheduling system, our pipeline manager currently supports only three (SLURM, PBS and LSF), plus any DRMAA-compliant batch submission system. Other schedulers can be added upon request.

This page contains a detailed explanation on how to use Daijin. We also provide a tutorial that will guide you through using the manager for analysing an RNA-Seq sample for Sc. pombe.

Hint

It is possible to launch the two steps of the pipeline directly with Snakemake, using the snakefiles located in Mikado.daijin: tr.snakefile for the first step, and mikado.snakefile for the second.

Configuring Daijin

daijin configure creates the configuration file that will drive Daijin, in YAML format. Most options can be specified by command line. Available parameters for the command line are:

  • out: Output file, ie the configuration file for Daijin.
  • od: Directory that Daijin will use as master for its jobs.
  • genome: Genome FASTA file. Required.
  • transcriptome: Optional reference transcriptome, to be used for alignment and assembly. Please note that Mikado will not use the reference annotation, it will only be used during the alignment and assembly steps.
  • name: Name to be applied to the genome, eg the species.

Warning

The name must avoid containing spaces or any non-standard character. This string will be used among other things to determine the name of the index to be used by the aligners; if the string contains non-valid characters it will cause them - and on cascade Daijin - to fail.

  • prot-db: this parameter specifies a protein database to be used for BLAST. If none is specified, this step will be omitted.
  • aligners: aligner(s) to be used during the run. Currently, Daijin supports the following aligners:
    • gsnap
    • star
    • tophat2
    • hisat2
  • assemblers: assembler(s) to be used during the run. Currently, Daijin supports the following RNA-Seq assemblers:
    • cufflinks
    • class2
    • stringtie
    • trinity
  • threads: Number of threads to be requested for parallelisable steps.
  • modes: Daijin can run Mikado in multiple modes regarding the handling of putative chimeras. Specify those you desire to execute here.
  • flank: Amount of flanking that Mikado will allow while looking for fragments around good gene loci. Default 1000 bps. It is advisable to reduce it in species with compact genomes.
  • scheduler: if Daijin has to execute the pipeline on a cluster (potentially using DRMAA), it is necessary to specify the scheduler here. At the moment we support the following widely used schedulers: PBS, LSF, SLURM.
  • r1, r2, samples: RNA-Seq reads 1, RNA-Seq reads 2, and sample name. At least one of each is required.
  • strandedness: if not specified, all samples will be assumed to be unstranded. Specify it as you would with HISAT or TopHat2.
  • cluster_config: if specified, a sample cluster configuration file will be copied in the specified location. A cluster configuration file has the following structure:
__default__:
    threads: 4
    memory: 10000
    queue: tgac-medium
asm_trinitygg:
    memory: 20000

As it can be seen, it is a YAML format with two fields: __default__ and asm_trinitygg. The former specifies common parameters for all the steps: the queue to be used, the number of threads to request, and the memory per job. The “asm_trinitygg” field specifies particular parameters to be applied to the asm_trinitygg rule - in this case, increasing the requested memory (Trinity is a de novo assembler and uses much more memory than most reference-guided assemblers). Please note that the number of threads specified on the command line, or in the configuration file proper, takes precedence over the equivalent parameter in the cluster configuration file.

$ daijin configure --help
usage: daijin configure [-h] [-c CLUSTER_CONFIG] [--threads N] [-od OUT_DIR]
                        [-o OUT] [--scheduler {,SLURM,LSF,PBS}] [--name NAME]
                        --genome GENOME [--transcriptome TRANSCRIPTOME]
                        [-r1 R1 [R1 ...]] [-r2 R2 [R2 ...]]
                        [-s SAMPLES [SAMPLES ...]]
                        [-st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]]
                        -al
                        [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
                        -as
                        [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
                        [--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml,spombe.yaml}]
                        [--copy-scoring COPY_SCORING]
                        [-m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]]
                        [--flank FLANK] [--prot-db PROT_DB [PROT_DB ...]]


optional arguments:
  -h, --help            show this help message and exit
  -al [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]], --aligners [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
                        Aligner(s) to use for the analysis. Choices: gsnap,
                        star, hisat, tophat2
  -as [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]], --assemblers [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
                        Assembler(s) to use for the analysis. Choices: class,
                        cufflinks, stringtie, trinity

Options related to how to run Daijin - threads, cluster configuration, etc.:
  -c CLUSTER_CONFIG, --cluster_config CLUSTER_CONFIG
                        Cluster configuration file to write to.
  --threads N, -t N     Maximum number of threads per job. Default: 4
  -od OUT_DIR, --out-dir OUT_DIR
                        Output directory. Default if unspecified: chosen name.
  -o OUT, --out OUT     Output file. If the file name ends in "json", the file
                        will be in JSON format; otherwise, Daijin will print
                        out a YAML file. Default: STDOUT.
  --scheduler {,SLURM,LSF,PBS}
                        Scheduler to use. Default: None - ie, either execute
                        everything on the local machine or use DRMAA to submit
                        and control jobs (recommended).

Arguments related to the reference species.:
  --name NAME           Name of the species under analysis.
  --genome GENOME, -g GENOME
                        Reference genome for the analysis, in FASTA format.
                        Required.
  --transcriptome TRANSCRIPTOME
                        Reference annotation, in GFF3 or GTF format.

Arguments related to the input paired reads.:
  -r1 R1 [R1 ...], --left_reads R1 [R1 ...]
                        Left reads for the analysis. Required.
  -r2 R2 [R2 ...], --right_reads R2 [R2 ...]
                        Right reads for the analysis. Required.
  -s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
                        Sample names for the analysis. Required.
  -st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...], --strandedness {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]
                        Strandedness of the reads. Specify it 0, 1, or number
                        of samples times. Choices: fr-unstranded, fr-
                        secondstrand, fr-firststrand.

Options related to the Mikado phase of the pipeline.:
  --scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}
                        Available scoring files.
  --copy-scoring COPY_SCORING
                        File into which to copy the selected scoring file, for
                        modification.
  -m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...], --modes {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]
                        Mikado pick modes to run. Choices: nosplit, split,
                        permissive, stringent, lenient
  --flank FLANK         Amount of flanking for grouping transcripts in
                        superloci during the pick phase of Mikado.
  --prot-db PROT_DB [PROT_DB ...]
                        Protein database to compare against, for Mikado.

Warning

if DRMAA is requested and no scheduler is specified, Daijin will fail. For this reason, Daijin requires a scheduler name. If one is not provided, Daijin will fall back to local execution.

Tweaking the configuration file

Daijin can be configured so to run each assembler and/or aligner multiple times, with a different set of parameters each. This is achieved by specifying the additional command line arguments in the array for each program. For example, in this section:

align_methods:
  hisat:
  - ''
  - "-k 10"
asm_methods:
  class:
  - ''
  - "--set-cover"
  stringtie:
  - ''
  - "-T 0.1"

we are asking Daijin to execute each program twice:

  • HISAT2: once with default parameters, once reporting the 10 best alignments for each pair
  • CLASS2: once with default parameters, once using the “set-cover” algorithm
  • Stringtie: once with default parameters, once lowering the minimum expression to 0.1 TPM.

If you are running Daijin on a cluster and the software tools have to be sourced or loaded, you can specify the versions and command to load in the configuration file itself. Edit the file at this section:

load:
  #  Commands to use to load/select the versions of the programs to use. Leave an empty
  #  string if no loading is necessary.
  blast: 'source blast-2.3.0'
  class: 'source class-2.12'
  cufflinks: ''
  gmap: ''
  hisat: 'source HISAT-2.0.4'
  mikado: 'souce mikado-devel'
  portcullis: 'source portcullis-0.17.2'
  samtools: 'source samtools-1.2'
  star: ''
  stringtie: 'source stringtie-1.2.4'
  tophat: ''
  transdecoder: 'source transdecoder-3.0.0'
  trinity: ''

In this case, the cluster requires to load software on-demand using the source command, and we specify the versions of the programs we wish to use.

Regarding read alignment, it is also possible to specify the minimum and maximum permissible intron lengths:

short_reads:
 #  Parameters related to the reads to use for the assemblies. Voices:
 #  - r1: array of left read files.
 #  - r2: array of right read files. It must be of the same length of r1; if one
 #    one or more of the samples are single-end reads, add an empty string.
 #  - samples: array of the sample names. It must be of the same length of r1.
 #  - strandedness: array of strand-specificity of the samples. It must be of the
 #    same length of r1. Valid values: fr-firststrand, fr-secondstrand, fr-unstranded.
 max_intron: 10000
 min_intron: 20
 r1:
 - SRR1617247_1.fastq.gz
 r2:
 - SRR1617247_2.fastq.gz
 samples:
 - SRR1617247
 strandedness:
 - fr-secondstrand

Regarding the Mikado stage of Daijin, the configuration file contains all the fields that can be found in a normal Mikado configuration file. All mikado-specific parameters are stored under the “mikado” field. It is also possible to modify the following:

  • blastx: these are parameters regarding the running of BLASTX. This field contains the following variables:
    • prot-db: this is an array of FASTA files to use as protein databases for the BLASTX step.
    • chunks: number of chunks to divide the BLASTX into. When dealing with big input FASTA files and with a cluster at disposal, it is more efficient to chunk the input FASTA in multiple smaller files and execute BLASTX on them independently. The default number of chunks is 10. Increase it as you see fit - it often makes sense, especially on clusters, to launch a multitude of small jobs rather than a low number of big jobs.
    • evalue: maximum e-value for the searches.
    • max_target_seqs: maximum number of hits to report.
  • transdecoder: parameters related to transdecoder. At the moment only one is available:
    • min_protein_len: minimum protein length that TransDecoder should report. The default value set by Mikado, 30, is much lower than the default (100) and this is intentional, as the chimera splitting algorithm relies on the capability of TransDecoder of finding incomplete short ORFs at different ends of a transcript.
Structure of the output directory

Daijin will organise the output directory in 5 major sections, plus the configuration file for the Mikado step:

  1. 1-reads: this folder contains links to the original read files.
  2. 2-alignments: this folder stores the indices built for each aligner, and the results. The structure is as follows:
    • output: this folder contains the final BAM files, sorted and indexed.
    • alignments.stats: this file contains a table reporting the most salient parameters derived from samtools stats calls onto the multiple BAM files.
    • One folder per aligner, inside which are present:
      • index folder, containing the genome indices for the tool
      • One folder per sample, containing the output BAM file.
  3. 3-assemblies: this folder contains the RNA-Seq assemblies which will be used as input by Mikado. The structure is as follows:
    • output: this folder contains the final GTF/GFF3 assemblies. For each combination of aligner/assembler that has been requested, Daijin will here provide:
    • assembly.stats: a tabular file collecting the most salient data from the statistics files generated for each assembly
    • One folder per assembly, containing tool-specific files and the final assembly.
  4. 4-portcullis: this folder contains the results of Portcullis, if its execution has been requested. The folder will contain the following:
    • output: folder which contains a merged BED file of reliable junctions, creating by merging all junctions from all alignments.
    • One folder per alignment analysed. We redirect you to the documentation of the tool for more details.
  5. mikado.yaml: final output file of the assemble part of the pipeline. This file will act both as the configuration for Daijin and for Mikado; for a description of the Mikado specific fields, we remand to the section on the configuration of the tool.
  6. 5-mikado: this folder contains the results for mikado. It is organised as follows:
    1. a link to the genome FASTA, and corresponding FAI file (generated with samtools)
    2. Files created by the prepare step:
      • mikado_prepared.fasta
      • mikado_prepared.gtf
      • prepare.log
    3. transdecoder: this folder contains the results of the TransDecoder run against the mikado_prepared.fasta file. Mikado will use the file transcripts.fasta.transdecoder.bed as source for the ORF location.
    4. blast: this folder contains the BLAST data. In particular:
      • index: this folder contains the FASTA file of the protein database, and the BLAST database files.
      • fastas: Daijin will split mikado_prepared.fasta into multiple files, for easier runs onto a cluster. This folder contains the splitted FASTAs.
      • xmls: this folder contains the XML files corresponding to the BLASTs of the files present in fastas
      • logs: this folder contains the log files corresponding to the BLASTs of the files present in fastas
    5. pick: this folder contains the results of Mikado pick. It is organissed as follows:
      • One folder per requested Mikado chimera-splitting mode. Inside each folder, it is possible to find:
        • mikado-{mode}.loci.gff3: Final GFF3 output file.
        • mikado-{mode}.metrics.gff3: Final metrics output file, containing the metrics of the transcripts that have been selected.
        • mikado-{mode}.scores.gff3: Final metrics output file, containing the scores associated to the evaluated metrics, for each of the selected transcripts.
        • mikado-{mode}.loci.stats: statistics file derived using mikado util stats (see the section on this utility for details)
      • comparison.stats: this tabular file collects the most salient data from the statistics files generated for each pick mode.
Running the pipeline

Daijin executes the pipeline in two distinct phases, assemble and mikado. Both commands have the same command line interface, namely:

 $ daijin assemble --help
usage: daijin assemble [-h] [-c HPC_CONF] [-d] [--jobs N] [--cores [N]]
                       [--threads N] [--no_drmaa] [--rerun-incomplete]
                       [--forcerun TARGET [TARGET ...]] [--detailed-summary]
                       [--list] [--dag]
                       config

positional arguments:
  config                Configuration file to use for running the transcript
                        assembly pipeline.

optional arguments:
  -h, --help            show this help message and exit
  -c HPC_CONF, --hpc_conf HPC_CONF
                        Configuration file that allows the user to override
                        resource requests for each rule when running under a
                        scheduler in a HPC environment.
  -d, --dryrun          Do a dry run for testing.
  --jobs N, -J N        Maximum number of cluster jobs to execute
                        concurrently.
  --cores [N], -C [N]   Use at most N cores in parallel (default: 1000).
  --threads N, -t N     Maximum number of threads per job. Default: None (set
                        in the configuration file)
  --no_drmaa, -nd       Use this flag if you wish to run without DRMAA, for
                        example, if running on a HPC and DRMAA is not
                        available, or if running locally on your own machine
                        or server.
  --rerun-incomplete, --ri
                        Re-run all jobs the output of which is recognized as
                        incomplete.
  --forcerun TARGET [TARGET ...], -R TARGET [TARGET ...]
                        Force the re-execution or creation of the given rules
                        or files. Use this option if you changed a rule and
                        want to have all its output in your workflow updated.
  --detailed-summary, -D
                        Print detailed summary of all input and output files
  --list, -l            List resources used in the workflow
  --dag                 Do not execute anything and print the redirected
                        acylic graph of jobs in the dot language.

The available command parameters are:

  • config: the configuration file.
  • hpc_conf: cluster configuration file.
  • jobs: Maximum number of jobs that can be executed (if Daijin is in local mode) or be present in the submission queue (if Daijin is in DRMAA/cluster mode) at any one time.
  • dryrun: do not execute, just list all the commands that will be executed. Useful also for listing the rules that have to be executed.
  • cores: Maximum number of cores that Daijin can claim at any one time.
  • threads: Maximum number of cores/threads that can be assigned to any step of the pipeline.
  • rerun-incomplete: Ask Snakemake to check which steps have produced empty or incomplete output files, and re-execute them and all the downstream commands.
  • forcerun: force the re-run of a
Assemble

In the first step of the pipeline, Daijin will perform the following operations for each of the read datasets provided:

  1. Create the necessary indices for each of the aligner programs requested.
  2. Align the read dataset using all the different tools requested, in all the possible combinations of parameters requested. * For example, it is possible to ask each dataset to be aligned twice with TopHat2 - once with the “micro-exon” mode activated, the second time without. Both alignments will be run independently. * It is possible to specify which datasets are strand-specific and which are not, and moreover, it is possible to specify the kind of strand-specificity (fr-secondstrand, fr-firststrand).
  3. Call all the reliable junctions across the alignments using Portcullis.
  4. Create the statistics for the assembly using samtools stat, and merge them together in a single file.
  5. Assemble each alignment with all the tools requested, in all the parameter combinations desired.
  6. Call the statistics on each assembly using mikado util stats, and merge them together in a single file.
  7. Create the configuration file for Mikado.

So during this first step Daijin will go from raw reads files to multiple assemblies, and configure Mikado for the second step.

Assembly pipeline, as driven by Daijin

_images/daijin_assemble.svg

Example of a pipeline to assemble a single paired-end read dataset using one aligner (Hisat [Hisat]) and two different RNA-Seq assemblers (StringTie [StringTie] and CLASS2 [Class2] ). Reliable junctions from the three alignments are called and merged together using Portcullis.

Mikado

In this step, the Daijin manager will execute all the steps necessary to perform Mikado on the desired inputs. The manager will execute the following steps:

  1. Merge all the input assemblies together using Mikado prepare
  2. Execute TransDecoder [Trinity] on the transcript sequences, to retrieve their ORFs.
  3. Split the FASTA file in as many chunks as specified during configuration, and analyse them separately
  4. Execute BLASTX+ [Blastplus] on the splitted FASTAs, creating BLAST XML outputs.
  5. Run Mikado serialise to load the BLAST results, TransDecoder ORFs, and portcullis junctions into a single database.
  6. Run Mikado pick on the data, in the selected modes.
  7. Collate and collapse the statistics for each of the filtered assemblies.

daijin mikado by default should use as config the configuration file created by daijin assemble, which will be located in <output directory>/mikado.yaml.

Mikado pipeline, as driven by Daijin

_images/daijin_mikado.svg

Example of a typical Mikado pipeline. In this case the number of chunks for BLAST is limited - 10 - but we advise to increase this number for big datasets.

Hint

If you have already created some assemblies and wish to analyse them with Daijin, it is also possible to configure Mikado externally and use the resulting configuration file to guide Daijin. At the time of this writing, this is also the recommended protocol for including eg Pacbio or EST alignments.

Mikado provides a pipeline manager, Daijin, to align and assemble transcripts with multiple methods and subsequently choose the best assemblies among the options. The pipeline is implemented using Snakemake [Snake].

Miscellaneous utilities
Mikado miscellaneous scripts

All these utilities can be accessed with the mikado util CLI. They perform relatively minor tasks.

awk_gtf

This utility is used to retrieve specific regions from a GTF file, without breaking any transcript feature. Any transcript falling even partly within the specified coordinates will be retained in its entirety.

Usage:

$ mikado util awk_gtf --help
usage: mikado.py util awk_gtf [-h] (-r REGION | --chrom CHROM) [-as]
                              [--start START] [--end END]
                              gtf [out]


positional arguments:
  gtf
  out

optional arguments:
  -h, --help            show this help message and exit
  -r REGION, --region REGION
                        Region defined as a string like <chrom>:<start>..<end>
  --chrom CHROM
  -as, --assume-sorted
  --start START
  --end END``
class_codes

This utility is used to obtain information about any class code or category thereof.

Usage:

$ mikado util class_codes --help
usage: mikado util class_codes [-h]
                               [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                               [-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]]
                               [-o OUT]
                               [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} ...]]

Script to print out the class codes.

positional arguments:
  {[],=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u}
                        Codes to query.

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
  -c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...], --category {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]
  -o OUT, --out OUT
convert

This utility is used to convert between GTF and GFF3 files, with the possibility of giving as output BED12 files as well. It is limited to converting transcript features, and will therefore ignore any other feature present (transposons, loci, etc.). The output of the conversion to GFF3 is completely GFF3 compliant.

Usage:

$ mikado util convert --help
usage: mikado.py util convert [-h] [-of {bed12,gtf,gff3}] gf [out]

positional arguments:
  gf
  out

optional arguments:
  -h, --help            show this help message and exit
  -of {bed12,gtf,gff3}, --out-format {bed12,gtf,gff3}
grep

This utility extracts specific transcripts and genes from an input GTF/GFF3 file. As input, it requires a text file of either the format “<transcript id><tab><gene id>”, or simply gene per line (in which case the “–genes” switch has to be invoked). If only some of the transcripts of a gene are included in the text file, the gene feature will be shrunk accordingly. The name is an obvious homage to the invaluable UNIX command that we all love.

Usage:

$ mikado util grep --help
usage: mikado.py util grep [-h] [-v] [--genes] ids gff [out]

Script to extract specific models from GFF/GTF files.

positional arguments:
  ids         ID file (format: mrna_id, gene_id - tab separated)
  gff         The GFF file to parse.
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit
  -v          Exclude from the gff all the records in the id file.
  --genes     Flag. If set, the program expects as ids only a list of genes,
              and will exclude/include all the transcripts children of the
              selected genes.
merge_blast

This script merges together various XML BLAST+ files into a single entity. It might be of use when the input data has been chunked into different FASTA files for submission to a cluster queue. It is also capable of converting from ASN files and of dealing with GZipped files.

Usage:

$ mikado util merge_blast --help
usage: mikado.py util merge_blast [-h] [-v] [-l LOG] [--out [OUT]]
                                  xml [xml ...]

positional arguments:
  xml

optional arguments:
  -h, --help         show this help message and exit
  -v, --verbose
  -l LOG, --log LOG
  --out [OUT]
metrics

This command generates the documentation regarding the available transcript metrics. It is generated dynamycally by inspecting the code. The documentation in the introduction is generated using this utility.

Usage:

$ mikado util metrics --help
usage: mikado util metrics [-h]
                           [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                           [-o OUT]
                           [-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]]
                           [metric [metric ...]]

Simple script to obtain the documentation on the transcript metrics.

positional arguments:
  metric

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
                        Format of the table to be printed out.
  -o OUT, --out OUT     Optional output file
  -c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...], --category {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]
                        Available categories to select from.
stats

This command generates a statistics file for GFF3/GTF files. The output is a table including Average, Mode, and various quantiles for different features present in a typical GFF file (genes, introns, exons, cDNAs, etc.). The operation can be quite time consuming for large files, in which case it is advisable to ask for multiple processors.

Usage:

$ mikado util stats --help
usage: mikado.py util stats [-h] [--only-coding] [-p PROCS] gff [out]

GFF/GTF statistics script. It will compute median/average length of RNAs,
exons, CDS features, etc.

positional arguments:
  gff                   GFF file to parse.
  out

optional arguments:
  -h, --help            show this help message and exit
  --only-coding
  -p PROCS, --processors PROCS

A typical example statistics file can be found here, for the TAIR10 annotation.

trim

This utility trims down the terminal exons of multiexonic transcripts, until either shrinking them to the desired maximum length or meeting the beginning/end of the CDS. It has been used for generating the “trimmed” annotations for the analysis of the original Mikado paper.

Usage:

$ mikado util trim --help
usage: mikado.py util trim [-h] [-ml MAX_LENGTH] [--as-gtf] ann [out]

positional arguments:
  ann                   Reference GTF/GFF output file.
  out

optional arguments:
  -h, --help            show this help message and exit
  -ml MAX_LENGTH, --max_length MAX_LENGTH
                        Maximal length of trimmed terminal exons
  --as-gtf              Flag. If set, the output will be in GTF rather than
                        GFF3 format.
Included scripts

All the following scripts are included in the “util” folder in the source code, and will be included on the PATH after installation. Some of this scripts are used by the The Daijin pipeline for driving Mikado pipeline to produce statistics or perform other intermediate steps.

add_transcript_feature_to_gtf.py

This script is needed to add a top-level transcript feature to GTFs that lack it, eg. those produced by CuffMerge [CuffMerge].

Usage:

$ add_transcript_feature_to_gtf.py --help
usage: Script to add a transcript feature to e.g. Cufflinks GTFs
       [-h] gtf [out]

positional arguments:
  gtf         Input GTF
  out         Output file. Default: stdout.

optional arguments:
  -h, --help  show this help message and exit
align_collect.py

This script is used to collect statistics from samtools stat. Usage:

$ align_collect.py  --help
usage: Script to collect info from multiple samtools stats files
       [-h] input [input ...]

positional arguments:
  input       The list of samtools stats file to process

optional arguments:
  -h, --help  show this help message and exit
asm_collect.py

This script is used to collect statistics obtained with from the mikado util stats utility. Output is printed directly to the screen. Usage:

$ asm_collect.py -h
usage: Script to collect info from multiple mikado util stats files
       [-h] input [input ...]

positional arguments:
  input       The list of mikado util stats file to process

optional arguments:
  -h, --help  show this help message and exit
bam2gtf.py

This script will use PySam to convert read alignments into a GTF file. Mostly useful to convert from BAM alignment of long reads (eg. PacBio) into a format which Mikado can interpret and use.

Usage:

$ bam2gtf.py --help
usage: Script to convert from BAM to GTF, for PB alignments [-h] bam [out]

positional arguments:
  bam         Input BAM file
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit
class_run.py

Python3 wrapper for the CLASS [Class2] assembler. It will perform the necessary operations for the assembler (depth and call of the splicing junctions), and launch the program itself. Usage:

$ class_run.py --help
usage: Quick utility to rewrite the wrapper for CLASS. [-h] [--clean]
                                                       [--force]
                                                       [-c CLASS_OPTIONS]
                                                       [-p PROCESSORS]
                                                       [--class_help] [-v]
                                                       [bam] [out]

positional arguments:
  bam                   Input BAM file.
  out                   Optional output file.

optional arguments:
  -h, --help            show this help message and exit
  --clean               Flag. If set, remove tepmorary files.
  --force               Flag. If set, it forces recalculation of all
                        intermediate files.
  -c CLASS_OPTIONS, --class_options CLASS_OPTIONS
                        Additional options to be passed to CLASS. Default: no
                        additional options.
  -p PROCESSORS, --processors PROCESSORS
                        Number of processors to use with class.
  --class_help          If called, the wrapper will ask class to display its
                        help and exit.
  -v, --verbose
getFastaFromIds.py

Script to extract a list of sequences from a FASTA file, using the pyfaidx [PyFaidx] module. Usage:

$ getFastaFromIds.py -h
usage: getFastaFromIds.py [-h] [-v] list fasta [out]

A simple script that retrieves the FASTA sequences from a file given a list of
ids.

positional arguments:
  list           File with the list of the ids to recover, one by line.
                 Alternatively, names separated by commas.
  fasta          FASTA file.
  out            Optional output file.

optional arguments:
  -h, --help     show this help message and exit
  -v, --reverse  Retrieve entries which are not in the list, as in grep -v (a
                 homage).
gffjunc_to_bed12.py

Script to convert a GFF junction file to a BED12 file. Useful to format the input for Mikado serialise.

Usage:

$ gffjunc_to_bed12.py --help
usage: GFF=>BED12 converter [-h] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help  show this help message and exit
grep.py

A script to extract data from column files, using a list of targets. More efficient than a standard “grep -f” for this niche case.

Usage:

$ util/grep.py -h
usage: grep.py [-h] [-v] [-s SEPARATOR] [-f FIELD] [-q] ids target [out]

This script is basically an efficient version of the GNU "grep -f" utility for
table-like files, and functions with a similar sintax.

positional arguments:
  ids                   The file of patterns to extract
  target                The file to filter
  out                   The output file

optional arguments:
  -h, --help            show this help message and exit
  -v, --reverse         Equivalent to the "-v" grep option
  -s SEPARATOR, --separator SEPARATOR
                        The field separator. Default: consecutive
                        whitespace(s)
  -f FIELD, --field FIELD
                        The field to look in the target file.
  -q, --quiet           No logging.
merge_junction_bed12.py

This script will merge [Portcullis]-like junctions into a single BED12, using the thick start/ends as unique keys.

Usage:

$ merge_junction_bed12.py --help
usage: Script to merge BED12 files *based on the thickStart/End features*.
    Necessary for merging junction files such as those produced by TopHat
       [-h] [--delim DELIM] [-t THREADS] [--tophat] [-o OUTPUT] bed [bed ...]

positional arguments:
  bed                   Input BED files. Use "-" for stdin.

optional arguments:
  -h, --help            show this help message and exit
  --delim DELIM         Delimiter for merged names. Default: ;
  -t THREADS, --threads THREADS
                        Number of threads to use for multiprocessing. Default:
                        1
  --tophat              Flag. If set, tophat-like junction style is assumed.
                        This means that junctions are defined using the
                        blockSizes rather than thickStart/End. The script will
                        convert the lines to this latter format. By default,
                        the script assumes that the intron start/end are
                        defined using thickStart/End like in portcullis.
                        Mixed-type input files are not supported.
  -o OUTPUT, --output OUTPUT
                        Output file. Default: stdout
remove_from_embl.py

Quick script to remove sequences from a given organism from SwissProt files, and print them out in FASTA format. Used to produce the BLAST datasets for the Mikado paper. Usage:

$ remove_from_embl.py -h
usage: Script to remove sequences specific of a given organism from a SwissProt file.
       [-h] -o ORGANISM [--format {fasta}] input [out]

positional arguments:
  input
  out

optional arguments:
  -h, --help            show this help message and exit
  -o ORGANISM, --organism ORGANISM
                        Organism to be excluded
  --format {fasta}      Output format. Choices: fasta. Default: fasta.
sanitize_blast_db.py

Simple script to clean the header of FASTA files, so to avoid runtime errors and incrongruencies with BLAST and other tools which might be sensitive to long descriptions or the presence of special characters.

Usage:

$ sanitize_blast_db.py --help
usage: sanitize_blast_db.py [-h] [-o OUT] fasta [fasta ...]

positional arguments:
  fasta

optional arguments:
  -h, --help         show this help message and exit
  -o OUT, --out OUT
split_fasta.py

This script is used to split a FASTA file in a fixed number of files, with an approximate equal number of sequences in each. If the number of sequences in the input file is lower than the number of requested splits, the script will create the necessary number of empty files. Used in The Daijin pipeline for driving Mikado for preparing the input data for the BLAST analysis. Usage:

$ split_fasta.py --help
usage: Script to split FASTA sequences in a fixed number of multiple files.
       [-h] [-m NUM_FILES] fasta [out]

positional arguments:
  fasta                 Input FASTA file.
  out                   Output prefix. Default: filename+split

optional arguments:
  -h, --help            show this help message and exit
  -m NUM_FILES, --num-files NUM_FILES
                        Number of files to create. Default: 1000
trim_long_introns.py

This script parses an annotation file and truncates any transcript which has UTR introns over the provided threshold. In such cases, the UTR section after the long intron is simply removed. Usage:

$ trim_long_introns.py --help
usage: This script truncates transcript with UTR exons separated by long introns.
       [-h] [-mi MAX_INTRON] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help            show this help message and exit
  -mi MAX_INTRON, --max-intron MAX_INTRON
                        Maximum intron length for UTR introns.

Finally, Mikado provides some dedicated utilities to perform common tasks.

  • Some of the utilities are integral to the Mikado suite and can be accessed as subcommands of Mikado. These utilities comprise programs to calculate annotation statistics, retrieve or exclude specific loci from a file, etc.
  • Other utilities are provided as stand-alone scripts; while some of them directly depend on the Mikado library, this is not necessarily the case for them all.

Mikado core algorithms

Picking transcripts: how to define loci and their members

The Mikado pick algorithm

_images/Mikado_algorithm.jpeg

Schematic representation of how Mikado unbundles a complex locus into two separate genes.

Transcripts are scored and selected according to user-defined rules, based on many different features of the transcripts themselves (cDNA length, CDS length, UTR fraction, number of reliable junctions, etc.; please see the dedicated section on scoring for details on the scoring algorithm).

The detection and analysis of a locus proceeds as follows:

  1. When the first transcript is detected, Mikado will create a superlocus - a container of transcripts sharing the same genomic location - and assign the transcript to it.

  2. While traversing the genome, as long as any new transcript is within the maximum allowed flanking distance, it will be added to the superlocus.

  3. When the last transcript is added, Mikado performs the following preliminary operations:
    1. Integrate all the data from the database (including ORFs, reliable junctions in the region, and BLAST homology).
    2. If a transcript is monoexonic, assign or reverse its strand if the ORF data supports the decision
    3. If requested and the ORF data supports the operation, split chimeric transcripts - ie those that contain two or more non-overlapping ORFs on the same strand.
    4. Split the superlocus into groups of transcripts that:
      • share the same strand
      • have at least 1bp overlap
    5. Analyse each of these novel “stranded” superloci separately.
  4. Create subloci, ie group transcripts so to minimize the probability of mistakenly merging multiple gene loci due to chimeras. These groups are defined as follows:
    • if the transcripts are multiexonic, they must share at least one intron, inclusive of the borders
    • if the transcripts are monoexonic, they must overlap by at least 1bp.
    • Monoexonic and multiexonic transcripts cannot be part of the same sublocus.
  5. Select the best transcript inside each sublocus:
    1. Score the transcripts (see the section on scoring)
    2. Select as winner the transcript with the highest score and assign it to a monosublocus
    3. Discard any transcript which is overlapping with it, according to the definitions in the point above
    4. Repeat the procedure from point 2 until no transcript remains in the sublocus
  6. Monosubloci are gathered together into monosubloci holders, ie the seeds for the gene loci. Monosubloci holders have more lenient parameters to group transcripts, as the first phase should have already discarded most chimeras. Once a holder is created by a single monosublocus, any subsequent candidate monosublocus will be integrated only if the following conditions are satisfied:
    • if the candidate is monoexonic, its exon must overlap at least one exon of a transcript already present in the holder
    • if the candidate is multiexonic and the holder contains only monoexonic transcripts, apply the same criterion, ie check whether its exons overlap the exons of at least one of the transcripts already present
    • if the candidate is multiexonic and the holder contains multiexonic transcripts, check whether one of the following conditions is satisfied:
      • at least one intron of the candidate overlaps with an intron of a transcript in the holder
      • at least one intron of the candidate is completely contained within an exon of a transcript in the holder
      • at least one intron of a transcript in the holder is completely contained within an exon of a transcript in the holder.
      • the cDNA overlap and CDS overlap between the candidate and the transcript in the holder are over a specified threshold.

    Optionally, it is possible to tell Mikado to use a simpler algorithm, and integrate together all transcripts that share exon space. Such a simpler algorithm risks, however, chaining together multiple loci - especially in small, compact genomes.

  7. Once the holders are created, apply the same scoring and selection procedure of the sublocus selection step. The winning transcripts are assigned to the final loci. These are called the primary transcripts of the loci.

  8. Once the loci are created, track back to the original transcripts of the superlocus:
    1. discard any transcript overlapping more than one locus, as these are probably chimeras.
    2. For those transcripts that are overlapping to a single locus, verify that they are valid alternative splicing events using the class code of the comparison against the primary transcript. Transcripts are re-scored dynamically when they are re-added in this fashion, to ensure their quality when compared with the primary transcript.
  9. Finally detect and either tag or discard fragments inside the initial superlocus (irrespective of strand):
    1. Check whether the primary transcript of any locus meets the criteria to be defined as a fragment (by default, maximum ORF of 30AA and maximum 2 exons - any transcript exceeding either criterion will be considered as non-fragment by default)
    2. If so, verify whether they are near enough any valid locus to be considered as a fragment (in general, class codes which constitute the “Intronic”, “Fragmentary” and “No overlap” categories).
    3. If these conditions are met, tag the locus as a fragment. If requested, Mikado will just discard these transcripts (advised).

These steps help Mikado identify and solve fusions, detect correctly the gene loci, and define valid alternative splicing events.

Transcript measurements and scoring

In order to determine the best transcript for each locus, Mikado measures each available candidate according to various different metrics and assigning a specific score for each of those. Similarly to RAMPART [Rampart], Mikado will assign a score to each transcript for each metric by assessing it relatively to the other transcripts in the locus. The particular feature rescaling equation used for a given metric depends on the type of feature it represents:

  • metrics where higher values represent better transcript assemblies (“maximum”).
  • metrics where lower values represent better transcript assemblies (“minimum”)
  • metrics where values closer to a defined value represent better assemblies (“target”)

To allow for this tripartite scoring system with disparate input values, we have to employ rescaling equations so that each metric for each transcript will be assigned a score between 0 and 1. Optionally, each metric might be assigned a greater weight so that its maximum possible value will be greater or smaller than 1. Formally, let metric \(m\) be one of the available metrics \(M\), \(t\) a transcript in locus \(L\), \(w_{m}\) the weight assigned to metric \(m\), and \(r_{mt}\) the raw value of metric \(m\) for \(t\). Then, the score to metric \(m\) for transcript \(t\), \(s_{mt}\), will be derived using one of the following three different rescaling equations:

  • If higher values are best:
    \(s_{mt} = w_{m} * (\frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
  • If lower values are best:
    \(s_{mt} = w_{m} * (1 - \frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
  • If values closer to a target \(v_{m}\) are best:
    \(s_{mt} = w_{m} * (1 - \frac{|r_{mt} - v_{m}|}{max(|r_{m} - v_{m}|)})\)
Finally, the scores for each metric will be summed up to produce a final score for the transcript:
\(s_{t} = \sum_{m \forall m \in M} s_{mt}\).

Not all the available metrics will be necessarily used for scoring; the choice of which to employ and how to score and weight each of them is left to the experimenter, although Mikado provides some pre-configured scoring files.

Important

The scoring algorithm is dependent on the other transcripts in the locus, so each score should not be taken as an absolute measure of the reliability of a transcript, but rather as a measure of its relative goodness compared with the alternatives. Shifting a transcript from one locus to another can have dramatic effects on the scoring of a transcript, even while the underlying metric values remain unchanged. This is why the score assigned to each transcript changes throughout the Mikado run, as transcripts are moved to subloci, monoloci and finally loci.

Note

Starting from beta8, Mikado allows for metrics whose value range is between 0 and 1 to be used directly as scores.

Scoring files

Mikado employs user-defined configuration files to define the desirable features in genes. These files are in either YAML or JSON format (default YAML) and are composed of four sections:

  1. a requirements section, specifying the minimum requirements that a transcript must satisfy to be considered as valid. Any transcript failing these requirements will be scored at 0 and purged.
  2. a not_fragmentary section, specifying the minimum requirements that the primary transcript of a locus has to satisfy in order for the locus not to be considered as a putative fragment.
  3. an as_requirements section, which specifies the minimum requirements for transcripts for them to be considered as possible valid alternative splicing events.
  4. a scoring section, specifying which features Mikado should look for in transcripts, and how each of them will be weighted.

Conditions are specified using a strict set of available operators and the values they have to consider.

Important

Although at the moment Mikado does not offer any method to establish machine-learning based scoring configurations, it is a topic we plan to investigate in the future. Mikado already supports Random Forest Regressors as scorers through Scikit-learn, but we have yet to devise a proper way to create such regressors.

Operators

Mikado allows the following operators to express a relationship inside the scoring files:

  • eq: equal to (\(=\)). Valid for comparisons with numbers, boolean values, and strings.
  • ne: different from (\(\neq\)). Valid for comparisons with numbers, boolean values, and strings.
  • lt: less than (\(<\)). Valid for comparisons with numbers.
  • gt: greater than (\(>\)). Valid for comparisons with numbers.
  • le: less or equal than (\(\le\)). Valid for comparisons with numbers.
  • ge: greater or equal than (\(\ge\)). Valid for comparisons with numbers.
  • in: member of (\(\in\)). Valid for comparisons with arrays or sets.
  • not in: not member of (\(\notin\)). Valid for comparisons with arrays or sets.
  • within: value comprised in the range of the two values, inclusive.
  • not within: value not comprised in the range of the two values, inclusive.

Mikado will fail if an operator not present on this list is specified, or if the operator is assigned to compare against the wrong data type (eg. eq with an array).

The “requirements”, “as_requirements” and “not_fragmentary” sections

These sections specifies the minimum requirements for a transcript at various stages. * A transcript failing to pass the requirements check will be discarded outright (if “purge” is selected) or given a score of 0 otherwise. * If a transcript has not been selected as the primary transcript of a locus, it has to pass the as_requirements check to be considered as a valid alternative splicing event. * Finally, after loci have been defined, the primary transcripts of loci that do not pass the not_fragmentary section mark their loci to be compared against neighbouring loci which have passed this same check.

It is strongly advised to use lenient parameters in the requirements section, as failing to do so might result in discarding whole loci. Typically, transcripts filtered at this step should be obvious fragments, eg monoexonic transcripts produced by RNA-Seq with a total length lower than the library fragment length. This section is composed by two parts:

  • parameters: a list of the metrics to be considered. Each metric can be considered multiple times, by suffixing it with a “.<id>” construct (eg cdna_length.*mono* vs. cdna_length.*multi* to distinguish two uses of the cdna_length metric - once for monoexonic and once for multiexonic transcripts). Any parameter which is not a valid metric name, after removal of the suffix, will cause an error. Parameters have to specify the following:
    • a value that the metric has to be compared against
    • an operator that specifies the target operation. See the operators section.
  • expression: a string array that will be compiled into a valid boolean expression. All the metrics present in the expression string must be present in the parameters section. If an unrecognized metric is present, Mikado will crash immediately, complaining that the scoring file is invalid. Apart from brackets, Mikado accepts only the following boolean operators to chain the metrics:
    • and
    • or
    • not
    • xor

Hint

if no expression is specified, Mikado will construct one by chaining all the provided parameters with and and operator. Most of the time, this would result in an unexpected behaviour - ie Mikado assigning a score of 0 to most transcripts. It is strongly advised to explicitly provide a valid expression.

As an example, the following snippet replicates a typical requirements section found in a scoring file:

requirements:
  expression: [((exon_num.multi and cdna_length.multi and max_intron_length and min_intron_length), or,
    (exon_num.mono and cdna_length.mono))]
  parameters:
    cdna_length.mono: {operator: gt, value: 50}
    cdna_length.multi: {operator: ge, value: 100}
    exon_num.mono: {operator: eq, value: 1}
    exon_num.multi: {operator: gt, value: 1}
    max_intron_length: {operator: le, value: 20000}
    min_intron_length: {operator: ge, value: 5}
In order:
  • In the parameters section, we ask for the following:
    • exon_num.mono: monoexonic transcripts must have one exon (“eq”)
    • exon_num.multi: multiexonic transcripts must have more than one exon (“gt”)
    • cdna_length.mono: monoexonic transcripts must have a length greater than 50 bps (the “.mono” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
    • cdna_length.multi: multiexonic transcripts must have a length greater than or equal to 100 bps (the “.multi” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
    • max_intron_length: multiexonic transcripts should not have any intron longer than 200,000 bps.
    • min_intron_length: multiexonic transcripts should not have any intron smaller than 5 bps.
  • the expression field will be compiled into the following expression:

    (exon_num > 1 and cdna_length >= 100 and max_intron_length <= 200000 and min_intron_length >= 5) or (exon_num == 1 and cdna_length > 50)
    

Any transcript for which the expression evaluates to \(False\) will be assigned a score of 0 outright and discarded, unless the user has chosen to disable the purging of such transcripts.

The scoring section

This section specifies which metrics will be used by Mikado to score the transcripts. Each metric to be used is specified as a subsection of the configuration, and will have the following attributes:

  • rescaling: the only compulsory attribute. It specifies the kind of scoring that will be applied to the metric, and it has to be one of “max”, “min”, or “target”. See the explanation on the scoring algorithm for details.
  • value: compulsory if the chosen rescaling algorithm is “target”. This should be either a number or a boolean value.
  • multiplier: the weight assigned to the metric in terms of scoring. This parameter is optional; if absent, as it is in the majority of cases, Mikado will consider the multiplier to equal to 1. This is the \(w_{m}\) element in the equations above.
  • filter: It is possible to specify a filter which the metric has to fulfill to be considered for scoring, eg, “cdna_length >= 200”. If the transcript fails to pass this filter, the score for this metric only will be set to 0. A “filter” subsection has to specify the following:
    • operator: the operator to apply for the boolean expression. See the relative section.
    • value: value that will be used to assess the metric.

Hint

the purpose of the filter section is to allow for fine-tuning of the scoring mechanism; ie it allows to penalise transcripts with undesirable qualities (eg a possible retained intron) without discarding them outright. As such, it is a less harsh version of the requirements section and it is the preferred way of specifying which transcript features Mikado should be wary of.

For example, this is a snippet of a scoring section:

scoring:
    blast_score: {rescaling: max}
    cds_not_maximal: {rescaling: min}
    combined_cds_fraction: {rescaling: target, value: 0.8, multiplier: 2}
    five_utr_length:
        filter: {operator: le, value: 2500}
        rescaling: target
        value: 100
    end_distance_from_junction:
        filter: {operator: lt, value: 55}
        rescaling: min

Using this snippet as a guide, Mikado will score transcripts in each locus as follows:

  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the greatest blast_score
  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest amount of CDS bases in secondary ORFs (cds_not_maximal)
  • Assign a full score (two points, as a multiplier of 2 is specified) to transcripts that have a total amount of CDS bps approximating 80% of the transcript cDNA length (combined_cds_fraction)
  • Assign a full score (one point, as no multiplier is specified) to transcripts that have a 5’ UTR whose length is nearest to 100 bps (five_utr_length); if the 5’ UTR is longer than 2,500 bps, this score will be 0 (see the filter section)
  • Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest distance between the CDS end and the most downstream exon-exon junction (end_distance_from_junction). If such a distance is greater than 55 bps, assign a score of 0, as it is a probable target for NMD (see the filter section).
Metrics

These are all the metrics available to quantify transcripts. The documentation for this section has been generated using the metrics utility.

Metrics belong to one of the following categories:

  • Descriptive: these metrics merely provide a description of the transcript (eg its ID) and are not used for scoring.
  • cDNA: these metrics refer to basic features of any transcript such as its number of exons, its cDNA length, etc.
  • Intron: these metrics refer to features related to the number of introns and their lengths.
  • CDS: these metrics refer to features related to the CDS assigned to the transcript.
  • UTR: these metrics refer to features related to the UTR of the transcript. In the case in which a transcript has been assigned multiple ORFs, unless otherwise stated the UTR metrics will be derived only considering the selected ORF, not the combination of all of them.
  • Locus: these metrics refer to features of the transcript in relationship to all other transcripts in its locus, eg how many of the introns present in the locus are present in the transcript. These metrics are calculated by Mikado during the picking phase, and as such their value can vary during the different stages as the transcripts are shifted to different groups.
  • External: these metrics are derived from accessory data that is recovered for the transcript during the run time. Examples include data regarding the number of introns confirmed by external programs such as PortCullis, or the BLAST score of the best hits.

Hint

Starting from version 1 beta8, Mikado allows to use externally defined metrics for the transcripts. These can be accessed using the keyword “external.<name of the metrics>” within the configuration file. See the relevant section for details.

Important

Starting from Mikado 1 beta 8, it is possible to use metrics with values between 0 and 1 directly as scores, without rescaling. This feature is available only for metrics whose values naturally lie between 0 and 1, or that are boolean in nature.

Metric name Description Category Data type Usable raw
         
tid ID of the transcript - cannot be an undefined value. Alias of id. Descriptive str False
parent Name of the parent feature of the transcript. Descriptive str False
score Numerical value which summarizes the reliability of the transcript. Descriptive str False
external_scores SPECIAL this Namespace contains all the information regarding external scores for the transcript. If an absent property is not defined in the Namespace, Mikado will set a default value of 0 into the Namespace and return it. External Namespace True
best_bits Metric that returns the best BitS associated with the transcript. External float False
blast_identity This metric will return the alignment identity for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: External float True
blast_query_coverage This metric will return the query coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: External float True
blast_score Interchangeable alias for testing different blast-related scores. Current: best bit score. External float False
blast_target_coverage This metric will return the target coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: External float True
canonical_intron_proportion This metric returns the proportion of canonical introns of the transcript on its total number of introns. Intron float True
cdna_length This property returns the length of the transcript. cDNA int False
cds_not_maximal This property returns the length of the CDS excluded from the selected ORF. CDS int False
cds_not_maximal_fraction This property returns the fraction of bases not in the selected ORF compared to the total number of CDS bases in the cDNA. CDS float True
combined_cds_fraction This property return the percentage of the CDS part of the transcript vs. the cDNA length CDS float True
combined_cds_intron_fraction This property returns the fraction of CDS introns of the transcript vs. the total number of CDS introns in the Locus. If the transcript is by itself, it returns 1. Locus   True
combined_cds_length This property return the length of the CDS part of the transcript. CDS int False
combined_cds_locus_fraction This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. Locus float True
combined_cds_num This property returns the number of non-overlapping CDS segments in the transcript. CDS int False
combined_cds_num_fraction This property returns the fraction of non-overlapping CDS segments in the transcript vs. the total number of exons CDS float True
combined_utr_fraction This property returns the fraction of the cDNA which is not coding according to any ORF. Complement of combined_cds_fraction UTR float True
combined_utr_length This property return the length of the UTR part of the transcript. UTR int False
end_distance_from_junction This metric returns the cDNA distance between the stop codon and the last junction in the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. This metric considers the combined CDS end. CDS int False
end_distance_from_tes This property returns the distance of the end of the combined CDS from the transcript end site. If no CDS is defined, it defaults to 0. CDS int False
exon_fraction This property returns the fraction of exons of the transcript which are contained in the sublocus. If the transcript is by itself, it returns 1. Set from outside. Locus float True
exon_num This property returns the number of exons of the transcript. cDNA int False
five_utr_length Returns the length of the 5’ UTR of the selected ORF. UTR   False
five_utr_num This property returns the number of 5’ UTR segments for the selected ORF. UTR int False
five_utr_num_complete This property returns the number of 5’ UTR segments for the selected ORF, considering only those which are complete exons. UTR int False
has_start_codon Boolean. True if the selected ORF has a start codon. CDS bool False
has_stop_codon Boolean. True if the selected ORF has a stop codon. CDS bool False
highest_cds_exon_number This property returns the maximum number of CDS segments among the ORFs; this number can refer to an ORF DIFFERENT from the maximal ORF. CDS int False
highest_cds_exons_num Returns the number of CDS segments in the selected ORF (irrespective of the number of exons involved) CDS int False
intron_fraction This property returns the fraction of introns of the transcript vs. the total number of introns in the Locus. If the transcript is by itself, it returns 1. Set from outside. Locus float True
is_complete Boolean. True if the selected ORF has both start and end. CDS bool False
max_exon_length This metric will return the length of the biggest exon in the transcript. cDNA int False
max_intron_length This property returns the greatest intron length for the transcript. Intron int False
min_exon_length This metric will return the length of the biggest exon in the transcript.     False
min_intron_length This property returns the smallest intron length for the transcript. Intron int False
non_verified_introns_num This metric returns the number of introns of the transcript which are not validated by external data. External int False
num_introns_greater_than_max This metric returns the number of introns greater than the maximum acceptable intron size indicated in the constructor. Intron int False
num_introns_smaller_than_min This metric returns the number of introns smaller than the mininum acceptable intron size indicated in the constructor. Intron int False
number_internal_orfs This property returns the number of ORFs inside a transcript. CDS int False
only_non_canonical_splicing This metric will return True if the canonical_number is 0 Intron bool False
proportion_verified_introns This metric returns, as a fraction, how many of the transcript introns are validated by external data. Monoexonic transcripts are set to 1. External float True
proportion_verified_introns_inlocus This metric returns, as a fraction, how many of the verified introns inside the Locus are contained inside the transcript. Locus float True
retained_fraction This property returns the fraction of the cDNA which is contained in retained introns. Locus float True
retained_intron_num This property records the number of introns in the transcripts which are marked as being retained. See the corresponding method in the sublocus class. Locus int False
selected_cds_exons_fraction Returns the fraction of CDS segments in the selected ORF (irrespective of the number of exons involved) CDS float True
selected_cds_fraction This property calculates the fraction of the selected CDS vs. the cDNA length. CDS float True
selected_cds_intron_fraction This property returns the fraction of CDS introns of the selected ORF of the transcript vs. the total number of CDS introns in the Locus (considering only the selected ORF). If the transcript is by itself, it should return 1. CDS float True
selected_cds_length This property calculates the length of the CDS selected as best inside the cDNA. CDS int False
selected_cds_locus_fraction This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. Locus float True
selected_cds_num This property calculates the number of CDS exons for the selected ORF CDS int False
selected_cds_number_fraction This property returns the proportion of best possible CDS segments vs. the number of exons. See selected_cds_number. CDS float False
selected_end_distance_from_junction This metric returns the distance between the stop codon and the last junction of the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps, otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. CDS int False
selected_end_distance_from_tes This property returns the distance of the end of the best CDS from the transcript end site. If no CDS is defined, it defaults to 0. CDS int False
selected_start_distance_from_tss This property returns the distance of the start of the best CDS from the transcript start site. If no CDS is defined, it defaults to 0. CDS int False
snowy_blast_score Metric that indicates how good a hit is compared to the competition, in terms of BLAST similarities. As in SnowyOwl, the score for each hit is calculated by taking the coverage of the target and dividing it by (2 * len(self.blast_hits)). IMPORTANT: when splitting transcripts by ORF, a blast hit is added to the new transcript only if it is contained within the new transcript. This WILL screw up a bit the homology score. External float False
source_score This metric returns a score that is assigned to the transcript in virtue of its origin. External int False
start_distance_from_tss This property returns the distance of the start of the combined CDS from the transcript start site. If no CDS is defined, it defaults to 0. CDS int False
suspicious_splicing This metric will return True if the transcript either has canonical introns on both strands (probably a chimeric artifact between two neighbouring loci, or if it has no canonical splicing event but it would if it were assigned to the opposite strand (probably a strand misassignment on the part of the assembler/predictor). Intron bool False
three_utr_length Returns the length of the 5’ UTR of the selected ORF.   int False
three_utr_num This property returns the number of 3’ UTR segments (referred to the selected ORF). UTR int False
three_utr_num_complete This property returns the number of 3’ UTR segments for the selected ORF, considering only those which are complete exons. UTR int False
utr_fraction This property calculates the length of the UTR of the selected ORF vs. the cDNA length. UTR float True
utr_length Returns the sum of the 5’+3’ UTR lengths UTR int False
utr_num Returns the number of UTR segments (referred to the selected ORF). UTR int False
utr_num_complete Returns the number of UTR segments which are complete exons (referred to the selected ORF). UTR int False
verified_introns_num This metric returns the number of introns of the transcript which are validated by external data. External int False
External metrics

Starting from version 1 beta 8, Mikado allows to load external metrics into the database, to be used for evaluating transcripts. Metrics of this kind must have a value comprised between 0 and 1.

Technical details

Most of the selection (ie “pick”) stage of the pipeline relies on the implementation of the objects in the loci submodule. In particular, the library defines an abstract class, “Abstractlocus”, which requires all its children to implement a version of the “is_intersecting” method. Each implementation of the method is specific to the stage. So the superlocus class will require in the “is_intersecting” method only overlap between the transcripts, optionally with a flanking and optionally restricting the groups to transcripts that share the same strand. The sublocus class will implement a different algorithm, and so on. The scoring is effectuated by first asking to recalculate the metrics (.calculate_metrics) and subsequently to calculate the scores (.calculate_scores). Mikado will try to cache and avoid recalculation of metrics and scores as much as possible, to make the program faster.

Metrics are an extension of the property construct in Python3. Compared to normal properties, they are distinguished only by three optional descriptive attributes: category, usable_raw, and rtype. The main reason to subclass property is to allow Mikado to be self-aware of which properties will be used for scoring transcripts, and which will not. So, for example, in the following snippet from the Transcript class definition:

@property
def combined_cds(self):
    """This is a list which contains all the non-overlapping CDS
    segments inside the cDNA. The list comprises the segments
    as duples (start,end)."""
    return self.__combined_cds

@combined_cds.setter
def combined_cds(self, combined):
    """
    Setter for combined_cds. It performs some basic checks,
    e.g. that all the members of the list are integer duplexes.

    :param combined: list
    :type combined: list[(int,int)]
    """

    if ((not isinstance(combined, list)) or
            any(self.__wrong_combined_entry(comb) for comb in combined)):
        raise TypeError("Invalid value for combined CDS: {0}".format(combined))

@Metric
def combined_cds_length(self):
    """This property return the length of the CDS part of the transcript."""
    c_length = sum([c[1] - c[0] + 1 for c in self.combined_cds])
    if len(self.combined_cds) > 0:
        assert c_length > 0
    return c_length

combined_cds_length.category = "CDS"

@Metric
def combined_cds_num(self):
    """This property returns the number of non-overlapping CDS segments
    in the transcript."""
    return len(self.combined_cds)

combined_cds_num.category = "CDS"

@Metric
def has_start_codon(self):
    """Boolean. True if the selected ORF has a start codon.
    :rtype: bool"""
    return self.__has_start_codon

@has_start_codon.setter
def has_start_codon(self, value):
    """Setter. Checks that the argument is boolean.
    :param value: boolean flag
    :type value: bool
    """

    if value not in (None, False, True):
        raise TypeError(
            "Invalid value for has_start_codon: {0}".format(type(value)))
    self.__has_start_codon = value

has_start_codon.category = "CDS"

Mikado will recognize that “derived_children” is a normal property, while “combined_cds_length”, “combined_cds_num” and “has_start_codon” are Metrics (and as such, we assign them a “category” - by default, that attribute will be None.). Please note that Metrics behave and are coded like normal properties in any other regard - including docstrings and setters/deleters.

The requirements expression is evaluated using eval.

Warning

While we took pains to ensure that the expression is properly sanitised and inspected before eval, Mikado might prove itself to be permeable to clever code injection attacks. Do not execute Mikado with super user privileges if you do not want to risk from such attacks, and always inspect third-party YAML scoring files before execution!

References

[ParsEval]ParsEval: parallel comparison and analysis of gene structure annotations Daniel S Standage and Volker P Brendel. BMC Bioinformatics, 2012, doi:10.1186/1471-2105-13-187
[RGASP]Assessment of transcript reconstruction methods for RNA-seq Tamara Steijger, Josep F Abril, Pär G Engström, Felix Kokocinski, The RGASP Consortium, Tim J Hubbard, Roderic Guigó, Jennifer Harrow & Paul Bertone. Nature Methods, 2013, doi:10.1038/nmeth.2714
[SnowyOwl]SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models Ian Reid, Nicholas O’Toole, Omar Zabaneh, Reza Nourzadeh, Mahmoud Dahdouli, Mostafa Abdellateef, Paul MK Gordon, Jung Soh, Gregory Butler, Christoph W Sensen and Adrian Tsang. BMC Bioinformatics, 2014, doi:10.1186/1471-2105-15-229
[CuffMerge]Identification of novel transcripts in annotated genomes using RNA-Seq Adam Roberts, Harold Pimentel, Cole Trapnell and Lior Pachter. Bioinformatics, 2011, doi:10.1093/bioinformatics/btr355
[Class2]CLASS2: accurate and efficient splice variant annotation from RNA-seq reads Li Song, Sarven Sabunciyan and Liliana Florea. Bioinformatics, 2016, doi:10.1093/nar/gkw158
[PyFaidx]Efficient “pythonic” access to FASTA files using pyfaidx Matthew D Shirley​, Zhaorong Ma, Brent S Pedersen and Sarah J Wheelan. PeerJ PrePrints 3:e1196, 2015. doi:10.7287/peerj.preprints.970v1
[Snake]Snakemake—a scalable bioinformatics workflow engine Johannes Köster and Sven Rahmann1. Bioinformatics, 2012, doi:10.1093/bioinformatics/bts480
[Trinity]De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Brian J Haas, et al. Nature Protocols, 2013. doi:10.1038/nprot.2013.084
[Blastplus]BLAST+: architecture and applications. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BMC Bioinformatics, 2009. doi:10.1186/1471-2105-10-421
[STAR]STAR: ultrafast universal RNA-seq aligner Alexander Dobin, Carrie A. Davis1, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha1, Philippe Batut1, Mark Chaisson and Thomas R. Gingeras. Bioinformatics, 2012. doi:10.1093/bioinformatics/bts635
[Hisat]HISAT: a fast spliced aligner with low memory requirements Daehwan Kim, Ben Langmead and Stevan L Salzberg. Nature Methods, 2015. doi:10.1038/nmeth.3317
[TopHat2]TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley and Steven L Salzberg. Genome Biology, 2013. doi:10.1186/gb-2013-14-4-r36
[StringTie]StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell and Steven L Salzberg. Nature Biotechnology, 2015. doi:10.1038/nbt.3122
[GMAP]GMAP: a genomic mapping and alignment program for mRNA and EST sequences Thomas D. Wu and Colin K. Watanabe. Bioinformatics 2005. doi:10.1093/bioinformatics/bti310
[uORFs]uAUG and uORFs in human and rodent 5′untranslated mRNAs. Michele Iacono, Flavio Mignone and Graziano Pesole. Gene, 2005. doi:10.1016/j.gene.2004.11.041
[PyYaml]Pyyaml library K Simonov. http://pyyaml.org/, 2006.
[Cython]Cython: the best of both worlds. Stefan Behnel, RObert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn and Kurt Smith. AIP - Computing in science & engineering, 2011. doi:10.1109/MCSE.2010.118
[Numpy]The NumPy array: A structure for efficient numerical computation. Stefan van der Walt, S. Chris Colbert and Gael Varoquaux. Computing in Science & Engineering, 2011. doi:10.1109/MCSE.2011.37
[Scipy]SciPy: Open Source Scientific Tools for Python. Eric Jones and Travis Oliphant and Pearu Peterson et al. http://www.scipy.org/*, 2001.
[NetworkX]Exploring network structure, dynamics, and function using NetworkX Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart. Proceedings of the 7th Python in Science Conference (SciPy2008), 2008. doi:
[BioPython]Biopython: freely available Python tools for computational molecular biology and bioinformatics. PA Cock, T Antao, JT Chang, BA Bradman, CJ Cox, A Dalke, I Friedberg, T Hamelryck, F Kauff, B Wilczynski and MJL de Hoon. Bioinformatics, 2009. doi:10.1093/bioinformatics/btp163
[SciKit]Scikit-learn: Machine Learning in Python. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot and Édouard Duchesnay. The Journal of Machine Learning Research, 2011. doi:
[DRMAA]Distributed resource management application API Version 2 (DRMAA). P Tröger, R Brobst, D Gruber, M Mamonski and D Templeton. Open Grid Forum, 2012. doi:
[Apollo]Web Apollo: a web-based genomic annotation editing platform Eduardo Lee, Gregg A Helt, Justin T Reese, Monica C Munoz-Torres, Chris P Childers, Robert M Buels, Lincoln Stein, Ian H Holmes, Christine G Elsik and Suzanna E Lewis. Genome Biology, 2013, doi:10.1186/gb-2013-14-8-r93
[Rampart]RAMPART: a workflow management system for de novo genome assembly, Daniel Mapleson, Nizar Drou and David Swarbreck. Bioinformatics, 2015. doi:10.1093/bioinformatics/btv056
[Uniprot]UniProt: a hub for protein information The UniProt Consortium. Nucleic Acid Research, 2014. doi:10.1093/nar/gku989
[Portcullis]http://www.github.com/maplesod/portcullis/
[Oases]Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels Marcel H Schulz, Daniel R Zerbino, Martin Vingron and Ewan Birney. Bioinformatics, 2012. doi: 10.1093/bioinformatics/bts094
[Bridger]Bridger: a new framework for de novo transcriptome assembly using RNA-seq data Zheng Chang, Guojun Li, Juntao Liu, Cody Ashby, Deli Liu, Carole L Cramer and Xiuzhen Huang. Genome Biology, 2015. doi:10.1186/s13059-015-0596-2
[EviGeneTobacco]Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana Kenlee Nakasugi, Ross Crowhurst, Julia Bally, Peter Waterhouse. PLoS ONE, 2014. doi:10.1371/journal.pone.0091776
[TransAbyss]De novo assembly and analysis of RNA-seq data Gordon Robertson et al., Nature Methods, 2010. doi:10.1038/nmeth.1517
[Transrate]TransRate: reference-free quality assessment of de novo transcriptome assemblies Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M. Hibberd and Steven Kelly. Genome Research, 2016. doi:10.1101/gr.196469.115
[RSEMeval]Evaluation of de novo transcriptome assemblies from RNA-Seq data Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron Stewart and Colin N Dewey. Genome Biology, 2014. doi:10.1186/s13059-014-0553-5
[Maker]MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491
[EVM]Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments Brian J Haas, Steven L Salzberg, Wei Zhu, Mihaela Pertea, Jonathan E Allen, Joshua Orvis, Owen White, C Robin Buell and Jennifer R Wortman. Genome Biology, 2008. doi:10.1186/gb-2008-9-1-r7
[Augustus]WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes Katharina J. Hoff and Mario Stanke. Nucleic Acid Research, 2013. doi:10.1093/nar/gkt418
[Maker2]MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491
[PYinterval]https://github.com/chaimleib/intervaltree
[BXPython]https://bitbucket.org/james_taylor/bx-python/overview
[Snakeviz]https://jiffyclub.github.io/snakeviz/

Mikado

Mikado package
Subpackages
Mikado.configuration package
Submodules
Mikado.configuration.configurator module
Mikado.configuration.daijin_configurator module
Module contents
Mikado.daijin package
Module contents
Mikado.loci package
Submodules
Mikado.loci.abstractlocus module
Mikado.loci.excluded module
Mikado.loci.locus module
Mikado.loci.monosublocus module
Mikado.loci.monosublocusholder module
Mikado.loci.reference_gene module
Mikado.loci.sublocus module
Mikado.loci.superlocus module
Module contents
Mikado.parsers package
Submodules
Mikado.parsers.GFF module
Mikado.parsers.GTF module
Mikado.parsers.bed12 module
Mikado.parsers.blast_utils module
Mikado.parsers.blast_xml module
Mikado.parsers.gfannotation module
Module contents
Mikado.picking package
Submodules
Mikado.picking.loci_processer module
Mikado.picking.picker module
Module contents
Mikado.preparation package
Submodules
Mikado.preparation.annotation_parser module
Mikado.preparation.checking module
Mikado.preparation.prepare module
Module contents
Mikado.scales package
Submodules
Mikado.scales.accountant module
Mikado.scales.assigner module
Mikado.scales.class_codes module
Mikado.scales.compare module
Mikado.scales.contrast module
Mikado.scales.f1 module
Mikado.scales.resultstorer module
Module contents
Mikado.serializers package
Subpackages
Mikado.serializers.blast_serializer package
Submodules
Mikado.serializers.blast_serializer.hit module
Mikado.serializers.blast_serializer.hsp module
Mikado.serializers.blast_serializer.query module
Mikado.serializers.blast_serializer.target module
Mikado.serializers.blast_serializer.utils module
Mikado.serializers.blast_serializer.xml_serialiser module
Module contents
Submodules
Mikado.serializers.external module
Mikado.serializers.junction module
Mikado.serializers.orf module
Module contents
Mikado.subprograms package
Subpackages
Mikado.subprograms.util package
Submodules
Mikado.subprograms.util.awk_gtf module
Mikado.subprograms.util.class_codes module
Mikado.subprograms.util.convert module
Mikado.subprograms.util.grep module
Mikado.subprograms.util.merge_blast module
Mikado.subprograms.util.metrics module
Mikado.subprograms.util.stats module
Mikado.subprograms.util.trim module
Module contents
Submodules
Mikado.subprograms.compare module
Mikado.subprograms.configure module
Mikado.subprograms.pick module
Mikado.subprograms.prepare module
Mikado.subprograms.serialise module
Module contents
Mikado.transcripts package
Subpackages
Mikado.transcripts.transcript_methods package
Submodules
Mikado.transcripts.transcript_methods.finalizing module
Mikado.transcripts.transcript_methods.printing module
Mikado.transcripts.transcript_methods.retrieval module
Mikado.transcripts.transcript_methods.splitting module
Module contents
Submodules
Mikado.transcripts.clique_methods module
Mikado.transcripts.transcript module
Mikado.transcripts.transcriptchecker module
Module contents
Mikado.utilities package
Submodules
Mikado.utilities.dbutils module
Mikado.utilities.intervaltree module
Mikado.utilities.log_utils module
Mikado.utilities.overlap module
Module contents
Submodules
Mikado.exceptions module
Module contents