_images/mikado-logo.png

Mikado: pick your transcript

releases downloads conda downloads python_36_badge python_37_badge python_38_badge python_39_badge gh_action_status coverage

Authors:Venturini Luca, Yanes Luis, Caim Shabhonam, Mapleson Daniel, Kaithakottil Gemy George, Swarbreck David
Version:Mikado v2.3.2 (11 January 2022)

Mikado is a lightweight Python3 pipeline to identify the most useful or “best” set of transcripts from multiple transcript assemblies. Our approach leverages transcript assemblies generated by multiple methods to define expressed loci, assign a representative transcript and return a set of gene models that selects against transcripts that are chimeric, fragmented or with short or disrupted CDS. Loci are first defined based on overlap criteria and each transcript therein is scored based on up to 50 available metrics relating to ORF and cDNA size, relative position of the ORF within the transcript, UTR length and presence of multiple ORFs. Mikado can also utilize blast data to score transcripts based on proteins similarity and to identify and split chimeric transcripts. Optionally, junction confidence data as provided by Portcullis [Portcullis] can be used to improve the assessment. The best-scoring transcripts are selected as the primary transcripts of their respective gene loci; additionally, Mikado can bring back other valid splice variants that are compatible with the primary isoform.

Mikado uses GTF or GFF files as mandatory input. Non-mandatory but highly recommended input data can be generated by obtaining a set of reliable splicing junctions with Portcullis, by locating coding ORFs on the transcripts using Transdecoder, and by obtaining homology information through BLASTX [Blastplus].

Our approach is amenable to include sequences generated by de novo Illumina assemblers or reads generated from long read technologies such as PacBio.

Our tool was presented at Genome Science 2016, both as a poster and in a talk during the Bioinformatics showcase session.

Mikado was published in GigaScience in August 2018 [Mikado]. We provide a PDF copy of the open access paper on this website, for reference.

Development is currently active, and Mikado is tightly integrated in an upcoming pipeline for genome annotation refinement, Minos.

Mikado version 2: integrating multiple gene predictions

During the summer of 2019, we finished work on the new version of Mikado. The focus on the work was to make Mikado a software product capable of integrating the results of multiple gene annotations, similarly to [PASA] or [Maker2]. Contrary to Maker2, Mikado is not in itself a full annotation pipeline; we are currently work on one such, which will use Mikado first to clean the transcripts assemblies, and then to create a final gene annotation by comparing multiple ab initio annotations together with protein alignments and transcript assemblies or cDNA alignments.

Starting from this version, Mikado is therefore capable of considering arbitrary measures of transcript quality (e.g. transcript quantification or similarity of the predicted ORF against a known protein database); moreover, it is capable of reconcile the structures of the transcripts present in a single locus. This allows to e.g. add an inferred UTR for ab initio predictions using complementary RNAseq data. Through this mechanism, Mikado is also capable of reconstructing the correct ORF of transcripts present only in fragmentary form - as long as there is at least another transcript in the locus that can provide the missing data. This mechanism is similar to the one implemented in [PASA]. Please see the relevant section in Algorithms for details.

Citing

If you use Mikado in your work, please consider to cite:

Venturini L., Caim S., Kaithakottil G., Mapleson D.L., Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience, Volume 7, Issue 8, 1 August 2018, giy093, doi:10.1093/gigascience/giy093

If you also use Portcullis to provide reliable junctions to Mikado, either independently or as part of the Daijin pipeline, please consider to cite:

Mapleson D.L., Venturini L., Kaithakottil G., Swarbreck D. Efficient and accurate detection of splice junctions from RNAseq with Portcullis. GigaScience, Volume 7, Issue 12, 12 December 2018, giy131, doi:10.1093/gigascience/giy131

Availability and License

Open source code available on github: https://github.com/EI-CoreBioinformatics/mikado

For Linux and OSX (the latter only since v2.2.3) we also provide installation through Conda: https://anaconda.org/bioconda/mikado <https://anaconda.org/bioconda/mikado>.

Please report any issue you might encounter to the EI-CoreBioinformatics issue tracker.

This documentation is hosted publicly on read the docs: https://mikado.readthedocs.org/en/latest/

Mikado is available under GNU LGLP V3.

Acknowledgements

Mikado has greatly benefited from the public libraries, in particular [Cython], the [NetworkX] library, Scipy, Numpy and Pandas ([Scipy], [Numpy], [Pandas]), BioPython [BioPython], Intervaltree [PYinterval], and the BX library for a Cython implementation of interval trees [BXPython]. Moreover, Mikado makes liberal use of the PySAM [PySAM] library for analysing SAM/BAM files as well as for working with FASTA files. Mikado has also been constantly optimised using Snakeviz [Snakeviz], a tool which proved invaluable during the development process.

Credits

  • Luca Venturini (The software architect and developer)
  • Shabhonam Caim (Primary tester and analytic consultancy)
  • Daniel Mapleson (Developer of PortCullis and of the Daijin pipeline)
  • Luis Yanes (Software developer)
  • Gemy Kaithakottil (Tester and analytic consultancy)
  • David Swarbreck (Annotation guru and ideator of the pipeline)

Contents