
Mikado: pick your transcript¶
Authors: | Venturini Luca, Caim Shabhonam, Mapleson Daniel, Kaithakottil Gemy George, Swarbreck David |
---|---|
Version: | 1.0.1 (April 2017) |
Mikado is a lightweight Python3 pipeline to identify the most useful or “best” set of transcripts from multiple transcript assemblies. Our approach leverages transcript assemblies generated by multiple methods to define expressed loci, assign a representative transcript and return a set of gene models that selects against transcripts that are chimeric, fragmented or with short or disrupted CDS. Loci are first defined based on overlap criteria and each transcript therein is scored based on up to 50 available metrics relating to ORF and cDNA size, relative position of the ORF within the transcript, UTR length and presence of multiple ORFs. Mikado can also utilize blast data to score transcripts based on proteins similarity and to identify and split chimeric transcripts. Optionally, junction confidence data as provided by Portcullis [Portcullis] can be used to improve the assessment. The best-scoring transcripts are selected as the primary transcripts of their respective gene loci; additionally, Mikado can bring back other valid splice variants that are compatible with the primary isoform.
Mikado uses GTF or GFF files as mandatory input. Non-mandatory but highly recommended input data can be generated by obtaining a set of reliable splicing junctions with Portcullis, by locating coding ORFs on the transcripts using Transdecoder, and by obtaining homology information through BLASTX [Blastplus].
Our approach is amenable to include sequences generated by de novo Illumina assemblers or reads generated from long read technologies such as PacBio.
Our tool was presented at Genome Science 2016, both as a poster
and in a talk during the Bioinformatics showcase session
.
Citing¶
We are currently working on our paper, and we will be releasing a pre-print shortly. In the meantime, if you use Mikado please reference our github page: https://github.com/lucventurini/mikado
Availability and License¶
Open source code available on github: https://github.com/lucventurini/mikado
This documentation is hosted publicly on read the docs: https://mikado.readthedocs.org/en/latest/
Mikado is available under GNU LGLP V3.
Acknowledgements¶
Mikado has greatly benefitted from the public libraries, in particular the [NetworkX] library, Scipy and Numpy ([Scipy], [Numpy]), Intervaltree [PYinterval], and the BX library for a Cython implementation of interval trees. Mikado has also been constantly optimised using Snakeviz, a tool which proved invaluable during the development process.
Credits¶
- Luca Venturini (The software architect and developer)
- Shabhonam Caim (Primary tester and analytic consultancy)
- Daniel Mapleson (Developer of PortCullis and of the Daijin pipeline)
- Gemy Kaithakottil (Tester and analytic consultancy)
- David Swarbreck (Annotation guru and ideator of the pipeline)
Contents¶
Introduction¶
- Numerous algorithms have been proposed to analyse RNA-Seq data, both in terms of aligning the reads to a reference genome ([TopHat2], [STAR], [Hisat]) or to assemble them to infer the sequence and structure of the original molecules present in the sample. The latter phase can be performed either by using alignment data against the reference genome ([Cufflinks], [StringTie], [Class2], [Trinity]) or in the absence of such information ([Trinity], [Oases], [Bridger], ). Each of these methods has to contend with numerous sources of variability in the RNA-Seq data:
- alternative splicing events at the same locus
- extremely variable expression and therefore sequencing depth at different loci
- presence of orthologous genes with very similar cDNAs.
- tandem duplicated genes which can be artefactually reconstructed as a single, fused entity
Multiple assessments of these methods on real and simulated data. In particular, the RGASP consortium promoted a competition among some of the most popular methods in 2012 [RGASP]; the results showed that the accuracy of the methods was dependent on the input data, the expression levels of the genes, and on the species under analysis. A different concern regards the completeness of each of the assembly methods, ie whether methods that are not necessarily the best across all gene loci might nonetheless be capable of outperforming other competitors for specific contigs. Recently, approaches have been proposed to integrate multiple RNA-Seq assemblies by inferring their protein-coding content and collapsing them accordingly [EviGeneTobacco] or to determine the best proposed assembly using multiple measures related to how the transcript compares to the input RNA-Seq reads ([Transrate], [RSEMeval]).
A different but related line of research has focused on how to integrate data coming from multiple samples. While some researchers advocate for merging together the input data and assembling it [Trinity], others have developed methods to integrate multiple assemblies into a single coherent annotation ([CuffMerge], Stringtie-merge). Another common approach relies on a third party tool to integrate multiple assemblies together with other data - typically protein alignments and ab initio predictions - to select for the best model. The latter approach has been chosen by some of the most popular pipelines for genome annotation in the last years ([EVM], [Maker]).
Our tool, Mikado, contributes to this area of research by proposing a novel algorithm to integrate multiple transcript assemblies, leveraging additional data regarding the position of ORFs (generally derived with Transdecoder [Trinity]), sequence similarity (derived with BLAST [Blastplus]) and reliable splicing junctions (generally derived using Portcullis [Portcullis]). Through the combination of these input data sources, we are capable of identifying artefacts - especially gene fusions and small fragments - and retrieve the original transcripts in each locus with high accuracy and sensitivity.
An important caveat is that our approach does not look for the best candidate in terms of the input data, as Transrate [Transrate] or RSEM-Eval [RSEMeval] do. Rather, our approach is more similar to EvidentialGene [EviGeneTobacco] as Mikado will try to find the candidates most likely to represent the isoforms which would be annotated as canonical by a human annotator. Biological events such as intron retentions or trans-splicing that might be present in the sample are explicitly selected against, in order to provide a clean annotation of the genome.
Overview¶
Mikado analyses an ensemble of RNA-Seq assemblies by looking for overlapping transcripts based on their genomic position. Such clusters of genes, or superloci, are then analysed to locate the gene loci that will form the final annotation. In order, the steps of the pipeline are as follows:
- In the first step,
mikado prepare
will combine assemblies from multiple sources into a single coherent, filtered dataset. - In the second step,
mikado serialise
will collect data from multiple sources (Portcullis, Transdecoder, BLASTX) and load it into a database for fast integration with the input data. - In the final step,
mikado pick
will integrate the prepared transcripts with the serialised additional data to choose the best transcripts.
By far, the most important source of data for Mikado is Transdecoder, as the algorithms for finding and splitting chimeras depend on the detection of the ORFs on the transcript sequences. Please refer to the Algorithms section for more details on how Mikado operates.
Using these multiple data sources, and its distinctive iterative method to identify and disentangle gene loci, Mikado is capable of bringing together methods with very different results, and interpret the composition of a locus similarly to how a manual annotator would. This leads to cleaned RNA-Seq annotations that can be used as the basis of the downstream analysis of choice, such as eg a hint-based ab initio annotation with Augustus [Augustus] or Maker2 [Maker].
Mikado in action: recovering fused gene loci

Using data from the ORFs, Mikado (in red) is capable of identifying and breaking artefactual gene fusions found by the assemblers (in green) in this locus. This allows to report the correct gene loci, even when no method was capable of retrieving all the loci in isolation, and some of the correct genes were not present at all. Coding sections of Mikado models are in dark red, UTR segments are in pale red. The reference annotation is coloured in blue, with the same colour scheme - dark for coding regions, pale for UTRs.
The command-line interface¶
The Mikado suite provides two commands: mikado
and daijin
. The former provides access to the main functionalities of the suite:
$ mikado --help
usage: Mikado [-h] {configure,prepare,serialise,pick,compare,util} ...
Mikado is a program to analyse RNA-Seq data and determine the best transcript
for each locus in accordance to user-specified criteria.
optional arguments:
-h, --help show this help message and exit
Components:
{configure,prepare,serialise,pick,compare,util}
These are the various components of Mikado:
configure This utility guides the user through the process of
creating a configuration file for Mikado.
prepare Mikado prepare analyses an input GTF file and prepares
it for the picking analysis by sorting its transcripts
and performing some simple consistency checks.
serialise Mikado serialise creates the database used by the pick
program. It handles Junction and ORF BED12 files as
well as BLAST XML results.
pick Mikado pick analyses a sorted GTF/GFF files in order
to identify its loci and choose the best transcripts
according to user-specified criteria. It is dependent
on files produced by the "prepare" and "serialise"
components.
compare Mikado compare produces a detailed comparison of
reference and prediction files. It has been directly
inspired by Cufflinks's cuffcompare and ParsEval.
util Miscellaneous utilities
Each of these subcommands is explained in detail in the Usage section.
daijin
instead provides the interface to the Daijin pipeline manager, which manages the task of going from a dataset of multiple reads to the Mikado final picking. This is its interface:
$ daijin --help
usage: A Directed Acyclic pipeline for gene model reconstruction from RNA seq data.
Basically, a pipeline for driving Mikado. It will first align RNAseq reads against
a genome using multiple tools, then creates transcript assemblies using multiple tools,
and find junctions in the alignments using Portcullis.
This input is then passed into Mikado.
[-h] {configure,assemble,mikado} ...
optional arguments:
-h, --help show this help message and exit
Pipelines:
{configure,assemble,mikado}
These are the pipelines that can be executed via
daijin.
configure Creates the configuration files for Daijin execution.
assemble A pipeline that generates a variety of transcript
assemblies using various aligners and assemblers, as
well a producing a configuration file suitable for
driving Mikado.
mikado Using a supplied configuration file that describes all
input assemblies to use, it runs the Mikado pipeline,
including prepare, BLAST, transdecoder, serialise and
pick.
Installation¶
System requirements¶
Mikado requires CPython 3.4 or later to function (Python2 is not supported). Additionally, it requires a functioning installation of one among SQLite, PostgreSQL and MySQL. Mikado has additionally the following python dependencies:
wheel>=0.28.0
pyyaml
jsonschema
cython>=0.25
numpy
networkx>=1.10
sqlalchemy>=1
sqlalchemy_utils
biopython>=1.66
intervaltree
nose
pyfaidx
scikit-learn>=0.17.0
scipy>=0.15.0
frozendict
python-magic
drmaa
snakemake
docutils!=0.13.1
tabulate
ujson
Mikado can run with little system requirements, being capable of analysing human data with less than 4GB of RAM in all stages. It benefits from the availability of multiple processors, as many of the steps of the pipeline are parallelised.
Mikado is at its core a data integrator. The Daijin pipeline has been written to facilitate the creation of a functioning workflow. To function, it requires Snakemake [Snake] and the presence of at least one supported RNA-Seq aligner and one supported transcript assembler. If you are planning to execute it on a cluster, we do support job submission on SLURM, LSF and PBS clusters, either in the presence or absence of DRMAA.
Download¶
Mikado is available on PyPI, so it is possible to install it with
pip3 install mikado
The source for the latest release on PyPI can be obtained with
pip3 download mikado
As the package contains some core Cython components, it might be necessary to download and compile the source code instead of relying on the wheel.
Alternatively, Mikado can be installed from source by obtaining it from our GitHub repository. Either download the latest release or download the latest stable development snapshot with
git clone https://github.com/lucventurini/mikado.git; cd mikado
If you desire, the unstable development version can be obtained with the command
git checkout development
in the Git directory. Please note that at the time of this writing development proceeds quite rapidly.
Building and installing from source¶
If you desire to install Mikado from source, this can be achieved with
python3 setup.py bdist_wheel
Followed by
pip3 install dist/*whl
- ..note:
- If you want to update your installation of Mikado, the command to be executed is
pip install -U dist/*whl
Testing the installed module¶
It is possible to test whether Mikado has been built successfully by opening a python3 interactive session and digiting:
>> import Mikado
>> Mikado.test()
This will run all the tests included in the suite. Although code coverage is not perfect yet, it is at 70% for the whole package and over 80% for the core components.
Tutorial¶
This tutorial will guide you through a simple analysis of Mikado, using a small amount of data coming from an experiment on Arabidopsis thaliana. RNA-Seq data was obtained from study PRJEB7093 on ENA, aligned with STAR [STAR] against the TAIR10 reference genome, and assembled with four different programs. For this small example, we are going to focus on a small genomic region: Chr5, from 26,575,364 to 26,614,205.
During this analysis, you will require the following files:
chr5.fas.gz
: a FASTA file containing the Chr5 of A. thaliana.reference.gff3
: a GFF3 file with the annotation of the genomic slice we are interested in, for comparison purposes.junctions.bed
: a BED12 file of reliable splicing junctions in the region, identified using Portcullis [Portcullis]class.gtf
: a GTF file of transcripts assembled using CLASS [Class2]cufflinks.gtf
: a GTF file of transcripts assembled using Cufflinks [Cufflinks]stringtie.gtf
: a GTF file of transcripts assembled using Stringtie [StringTie]trinity.gff3
: a GFF3 file of transcripts assembled using Trinity [Trinity] and aligned using GMAP [GMAP]orfs.bed
: a BED12 file containing the ORFs of the above transcripts, derived using TransDecoder [Trinity]uniprot_sprot_plants.fasta.gz
: a FASTA file containing the plant proteins released with SwissProt [Uniprot]
All of this data can also be found in the sample_data
directory of the Mikado source.
You will also require the following software:
- a functioning installation of SQLite.
- a functioning version of BLAST+ [Blastplus].
Creating the configuration file for Mikado¶
In the first step, we need to create a configuration file to drive Mikado. To do so, we will first create a tab-delimited file describing our assemblies (class.gtf, cufflinks.gtf, stringtie.gtf, trinity.gff3):
class.gtf cl True
cufflinks.gtf cuff True
stringtie.gtf st True 1
trinity.gff3 tr False -0.5
In this file, the three fields define the following:
- The file location and name (if no folder is specified, Mikado will look for each file in the current working directory)
- An alias associated with the file, which has to be unique
- A binary flag (
True
/False
) indicating whether the assembly is strand-specific or not - An optional fourth field which defines a score associated with that sample. All transcripts associated with the label will have their score corrected by the value on this field. So eg. in this example all Stringtie models will receive an additional point, and all Trinity models will be penalised by half a point. Class and Cufflinks have no special bonus or malus associated with them.
Then, we will decompress the chromosome FASTA file:
gunzip chr5.fas.gz # This will create the file chr5.fas
Finally, we will create the configuration file itself using mikado configure
:
mikado configure --list list.txt --reference chr5.fas --mode permissive --scoring plants.yaml --copy-scoring plants.yaml --junctions junctions.bed -bt uniprot_sprot_plants.fasta configuration.yaml
This will create a configuration.yaml file with the parameters that were specified on the command line. This is simplified configuration file, containing all the necessary parameters for the Mikado run. It will also copy the plants.yaml
file from the Mikado installation to your current working directory.
The parameters we used for the command line instruct Mikado in the following ways:
- –list list.txt: this part of the command line instructs Mikado to read the file we just created to understand where the input files are and how to treat them.
- –reference chr5.fas: this part of the command line instructs Mikado on the location of the genome file.
- –mode permissive: the mode in which Mikado will treat cases of chimeras. See the documentation for details.
- –junctions junctions.bed: this part of the command line instructs Mikado to consider this file as the source of reliable splicing junctions.
- -bt uniprot_sprot_plants.fasta: this part of the command line instructs Mikado to consider this file as the BLAST database which will be used for deriving homology inormation.
Hint
The –copy-scoring argument is usually not necessary, however, it allows you to easily inspect the scoring file we are going to use during this run.
Mikado prepare¶
The subsequent step involves running mikado prepare
to create a sorted, non-redundant GTF with all the input assemblies. As we have already created a configuration file with all the details regarding the input files, this will require us only to issue the command:
mikado prepare --json-conf configuration.yaml
This command will create three files:
mikado_prepared.gtf: one of the two main output files. This is a sorted, non-redundant GTF containing the transcripts from the four input GTFs
mikado_prepared.fasta: a FASTA file of the transcripts present in mikado_prepared.gtf.
prepare.log: the log of this step. This should look like the following, minus the timestamps:
2016-08-10 13:53:58,443 - prepare - prepare.py:67 - INFO - setup - MainProcess - Command line: /usr/users/ga002/venturil/py351/bin/mikado prepare --json-conf configuration.yaml 2016-08-10 13:53:58,967 - prepare - prepare.py:206 - INFO - perform_check - MainProcess - Finished to analyse 95 transcripts (93 retained) 2016-08-10 13:53:58,967 - prepare - prepare.py:405 - INFO - prepare - MainProcess - Finished
At the end of this phase, you should have 93 candidate transcripts, as 2 were redundant.
BLAST of the candidate transcripts¶
Although it is not strictly necessary, Mikado benefits from integrating homology data from BLAST. Mikado requires this data to be provided either in XML or ASN format (in the latter case, blast_formatter
will be used to convert it in-memory to XML).
To create this file, we will proceed as follows:
Uncompress the SwissProt database:
gzip -dc uniprot_sprot_plants.fasta.gz > uniprot_sprot_plants.fasta
Prepare the database for the BLAST:
makeblastdb -in uniprot_sprot_plants.fasta -dbtype prot -parse_seqids > blast_prepare.log
Execute the BLAST, asking for XML output, and compress it to limit space usage.
blastx -max_target_seqs 5 -num_threads 10 -query mikado_prepared.fasta -outfmt 5 -db uniprot_sprot_plants.fasta -evalue 0.000001 2> blast.log | sed '/^$/d' | gzip -c - > mikado.blast.xml.gz
This will produce the mikado.blast.xml.gz
file, which contains the homology information for the run.
Mikado serialise¶
This step involves running mikado serialise
to create a SQLite database with all the information that mikado needs to perform its analysis. As most of the parameters are already specified inside the configuration file, the command line is quite simple:
mikado serialise --json-conf configuration.yaml --xml mikado.blast.xml.gz --orfs mikado.bed --blast_targets
After mikado serialise has run, it will have created two files:
mikado.db
, the SQLite database that will be used bypick
to perform its analysis.serialise.log
, the log of the run.
If you inspect the SQLite database mikado.db
, you will see it contains six different tables:
$ sqlite3 mikado.db
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
chrom hit hsp junctions orf query target
These tables contain the information coming, in order, from the genome FAI, the BLAST XML, the junctions BED file, the ORFs BED file, and finally the input transcripts and the proteins. For more details on the database structure, please refer to the section on this step in this documentation.
Mikado pick¶
Finally, during this step mikado pick
will integrate the data present in the database with the positional and structural data present in the GTF file to select the best transcript models. The command line to be issued is the following:
mikado pick --json-conf configuration.yaml --subloci_out mikado.subloci.gff3
At this step, we have to specify only some parameters for pick
to function:
- json-conf: the configuration file. This is the only compulsory option.
- subloci_out: the partial results concerning the subloci step during the selection process will be written to
mikado.subloci.gff3
.
mikado pick
will produce the following output files:
mikado.loci.gff3
,mikado.loci.metrics.tsv
,mikado.loci.scores.tsv
: the proper output files. These contain the location of the selected transcripts, their metrics, and their scores. Please see this section for details.mikado.subloci.gff3
,mikado.subloci.metrics.tsv
,mikado.subloci.scores.tsv
: these files contain the same type of information as those above, but for the subloci stage. As such, all the transcripts in the input files are represented, not just those that are going to be selected as the best.- mikado_pick.log: the log file for this operation.
Comparing files with the reference¶
Finally, we can compare our files to the original reference annotation, and see how our results are compared to those. To do so, we will use Mikado compare. The first step is to index the reference annotation to make the comparisons faster:
mikado compare -r reference.gff3 --index
This will create a new file, reference.gff3.midx
. If you inspect with eg. zless
, you will notice it is in essence a compressed JSON file, describing the locations and components of each gene on the annotation. Now that we have indexed the reference, we can perform the comparisons we are interested in:
- Reference vs. the input transcripts:
mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log;
- Reference vs. the subloci stage:
mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log;
- Reference vs the final output:
mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log;
Each of these comparisons will produce three files:
- a tmap file, detailing the best match in the reference for each of the query transcripts;
- a refmap file, detailing the best match among the query transcripts for each of the reference transcripts;
- a stats file, summarising the comparisons.
The stats file for the input GTF should look like this:
Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log
7 reference RNAs in 5 genes
93 predicted RNAs in 64 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 95.97 29.39 45.00
Exon level (stringent): 68.09 18.60 29.22
Exon level (lenient): 90.91 31.25 46.51
Intron level: 94.74 45.57 61.54
Intron chain level: 16.67 1.59 2.90
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 14.29 1.08 2.00
Transcript level (>=80% base F1): 14.29 1.08 2.00
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 20.00 1.56 2.90
Gene level (>=80% base F1): 20.00 1.56 2.90
# Matching: in prediction; matched: in reference.
Matching intron chains: 1
Matched intron chains: 1
Matching monoexonic transcripts: 0
Matched monoexonic transcripts: 0
Total matching transcripts: 1
Total matched transcripts: 1
Missed exons (stringent): 15/47 (31.91%)
Novel exons (stringent): 140/172 (81.40%)
Missed exons (lenient): 4/44 (9.09%)
Novel exons (lenient): 88/128 (68.75%)
Missed introns: 2/38 (5.26%)
Novel introns: 43/79 (54.43%)
Missed transcripts: 0/7 (0.00%)
Novel transcripts: 24/93 (25.81%)
Missed genes: 0/5 (0.00%)
Novel genes: 21/64 (32.81%)
For the subloci file, where we still have all the transcripts but we have split obvious chimeras, it should look like this:
Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log
7 reference RNAs in 5 genes
105 predicted RNAs in 26 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 95.96 29.24 44.83
Exon level (stringent): 70.21 19.08 30.00
Exon level (lenient): 88.89 32.00 47.06
Intron level: 94.74 46.75 62.61
Intron chain level: 33.33 3.17 5.80
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 28.57 9.52 14.29
Transcript level (>=80% base F1): 42.86 11.43 18.05
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 40.00 7.69 12.90
Gene level (>=80% base F1): 60.00 11.54 19.35
# Matching: in prediction; matched: in reference.
Matching intron chains: 3
Matched intron chains: 2
Matching monoexonic transcripts: 9
Matched monoexonic transcripts: 1
Total matching transcripts: 12
Total matched transcripts: 3
Missed exons (stringent): 14/47 (29.79%)
Novel exons (stringent): 140/173 (80.92%)
Missed exons (lenient): 5/45 (11.11%)
Novel exons (lenient): 85/125 (68.00%)
Missed introns: 2/38 (5.26%)
Novel introns: 41/77 (53.25%)
Missed transcripts: 0/7 (0.00%)
Novel transcripts: 24/105 (22.86%)
Missed genes: 0/5 (0.00%)
Novel genes: 13/26 (50.00%)
A marked improvement can already be seen - we have now 105 transcripts instead of 93, and the total number of matching transcripts has increased from 1 to 3. Precision is still poor, however, as we have not discarded any transcript yet. Moreover, we have redundancy - 9 transcripts match the same monoexonic gene, and 3 transcripts match 2 intron chains in the reference.
Finally, the comparison against the proper output (mikado.loci.gff3
) should look like this:
Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in 8 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 85.74 64.73 73.77
Exon level (stringent): 63.83 42.86 51.28
Exon level (lenient): 80.00 52.94 63.72
Intron level: 89.47 59.65 71.58
Intron chain level: 33.33 14.29 20.00
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 28.57 13.33 18.18
Transcript level (>=80% base F1): 42.86 20.00 27.27
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 40.00 25.00 30.77
Gene level (>=80% base F1): 60.00 37.50 46.15
# Matching: in prediction; matched: in reference.
Matching intron chains: 2
Matched intron chains: 2
Matching monoexonic transcripts: 1
Matched monoexonic transcripts: 1
Total matching transcripts: 3
Total matched transcripts: 3
Missed exons (stringent): 17/47 (36.17%)
Novel exons (stringent): 40/70 (57.14%)
Missed exons (lenient): 9/45 (20.00%)
Novel exons (lenient): 32/68 (47.06%)
Missed introns: 4/38 (10.53%)
Novel introns: 23/57 (40.35%)
Missed transcripts: 0/7 (0.00%)
Novel transcripts: 6/15 (40.00%)
Missed genes: 0/5 (0.00%)
Novel genes: 2/8 (25.00%)
After selecting the best transcripts in each locus, Mikado has discarded most of the incorrect transcripts while retaining most of the correct information; this can be seen in the increase in precision at eg. the nucleotide level (from 30% to 65%). The number of genes has also decreased, as Mikado has discarded many loci whose transcripts are just UTR fragments of neighbouring correct genes.
Analysing the tutorial data with Snakemake¶
The workflow described in this tutorial can be executed automatically using Snakemake [Snake] with this Snakefile
. Just execute:
snakemake
in the directory where you have downloaded all of the tutorial files. In graph representation, this is how the pipeline looks like:
Tutorial for Daijin¶
This tutorial will guide you through the task of configuring and running the whole Daijin pipeline on a A. thaliana dataset, using one aligner (HISAT) and two assemblers (Stringtie and CLASS2) as chosen methods. A modern desktop computer with a multicore processor and 4GB of RAM or more should suffice to execute the pipeline.
Overview¶
The tutorial will guide you through the following tasks:
- Configuring Daijin to analyse the chosen data
- Running the alignment and assemblies using
daijin assemble
- Running the Mikado analysis using
daijin mikado
- Comparing the results against the reference transcriptome
Required software¶
Mikado should be installed and configured properly (see our installation instructions). Additionally, you should have the following software tools at disposal (between brackets is indicated the version used at the time of the writing):
- BLAST+ (v2.3.0)
- TransDecoder (v3.0.0)
- Portcullis (v0.17.2)
- HISAT2 (v2.0.4)
- Stringtie (v1.2.4)
- CLASS2 (v2.12)
- SAMtools (v1.2)
Input data¶
Throughout this tutorial, we will use data coming from release 32 of EnsEMBL and from the PRJEB7093 experiment on ENA. In particular, we will need:
- the genome FASTA file of Arabidopsis thaliana
- its relative genome annotation
- RNA-Seq from one sample, ERR588044, of the PRJEB7093 study: left (ERR588044_1.fastq.gz) and right (ERR588044_2.fastq.gz)
- the SwissProt plants database (downloaded from here, using Uniprot release 2016-07)
Uncompress the genome, the annotation GFF3 and the Uniprot DAT files using GUNZip. The protein database must be converted into FASTA format, and we should remove the A. thaliana proteins to avoid biasing our results. To do so, use the script remove_from_embl.py
(included in Mikado; please include the quotes in the command, or the script will fail!):
remove_from_embl.py -o "Arabidopsis thaliana" uniprot_sprot_plants.dat uniprot_sprot_plants.fasta
Hint
it is not necessary nor advisable to remove proteins from the database in a real experiment. We are doing so here only to avoid biasing the Mikado results. You can convert the DAT files to FASTA using the same script but using “NULL” in place of “Arabidopsis thaliana”.
It is possible to have a feel for the annnotation of this species - the size of its genes and transcripts, the average number of exons per transcript, etc - by using mikado util stats
; just issue the following command:
mikado util stats Arabidopsis_thaliana.TAIR10.32.gff3 Arabidopsis_thaliana.TAIR10.32.gff3.stats
The file Arabidopsis_thaliana.TAIR10.32.gff3.stats contains the following information:
Stat | Total | Average | Mode | Min | 1% | 5% | 10% | 25% | Median | 75% | 90% | 95% | 99% | Max |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of genes | 33602 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of genes (coding) | 27397 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of monoexonic genes | 11079 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Transcripts per gene | 41671 | 1.24 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 4 | 10 |
Coding transcripts per gene | 35359 | 1.05 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 4 | 10 |
CDNA lengths | 64867032 | 1,556.65 | 72 | 22 | 73 | 255 | 463 | 835 | 1,361 | 1,963 | 2,835 | 3,614 | 5,517 | 16,347 |
CDNA lengths (mRNAs) | 54248365 | 1,534.22 | 357 | 22 | 144 | 371 | 555 | 897 | 1,383 | 1,919 | 2,664 | 3,268 | 4,826 | 16,347 |
CDS lengths | 43498974 | 1,043.87 | 0 | 0 | 0 | 0 | 0 | 386 | 903 | 1,449 | 2,154 | 2,724 | 4,238 | 16,182 |
CDS lengths (mRNAs) | NA | 1,043.87 | 0 | 0 | 0 | 0 | 0 | 386 | 903 | 1,449 | 2,154 | 2,724 | 4,238 | 16,182 |
CDS/cDNA ratio | NA | 78.44 | 100.0 | 2 | 35 | 48 | 56 | 68 | 80 | 92 | 100 | 100 | 100 | 100 |
Monoexonic transcripts | 11494 | 1,377.97 | 72 | 22 | 71 | 77 | 136 | 465 | 972 | 1,841 | 3,110 | 4,335 | 5,949 | 15,195 |
MonoCDS transcripts | 7367 | 862.50 | 357 | 22 | 99 | 138 | 192 | 357 | 675 | 1,206 | 1,791 | 2,136 | 2,822 | 6,885 |
Exons per transcript | 217183 | 5.21 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 7 | 12 | 15 | 24 | 79 |
Exons per transcript (mRNAs) | 2699 | 5.86 | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 8 | 13 | 16 | 25 | 79 |
Exon lengths | NA | 298.67 | 72 | 1 | 32 | 51 | 63 | 89 | 150 | 314 | 615 | 990 | 2,400 | 15,195 |
Exon lengths (mRNAs) | NA | 261.83 | 72 | 1 | 32 | 51 | 63 | 89 | 147 | 300 | 559 | 857 | 1,736 | 7,761 |
Intron lengths | NA | 165.84 | 87 | 8 | 67 | 74 | 78 | 86 | 100 | 168 | 334 | 461 | 849 | 57,631 |
Intron lengths (mRNAs) | NA | 164.69 | 87 | 8 | 67 | 75 | 78 | 86 | 100 | 168 | 333 | 458 | 838 | 11,602 |
CDS exons per transcript | 2308 | 4.73 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 7 | 11 | 14 | 24 | 78 |
CDS exons per transcript (mRNAs) | 2308 | 5.57 | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 8 | 12 | 15 | 24 | 78 |
CDS exon lengths | 43498974 | 220.92 | 72 | 1 | 19 | 45 | 58 | 81 | 127 | 228 | 458 | 735 | 1,572 | 7,761 |
CDS Intron lengths | 25183403 | 155.90 | 84 | 7 | 66 | 73 | 77 | 84 | 98 | 155 | 307 | 427 | 789 | 10,233 |
5’UTR exon number | 35359 | 0.98 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 12 |
3’UTR exon number | 35359 | 0.87 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 15 |
5’UTR length | 4119344 | 116.50 | 0 | 0 | 0 | 0 | 0 | 11 | 81 | 163 | 280 | 378 | 609 | 3,214 |
3’UTR length | 6630047 | 187.51 | 0 | 0 | 0 | 0 | 0 | 77 | 181 | 257 | 348 | 434 | 755 | 3,164 |
Stop distance from junction | NA | 7.62 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 213 | 2,286 |
Intergenic distances | NA | 1,429.33 | 254;260 | -57,916 | -1,251 | -42 | 68 | 279 | 822 | 1,899 | 3,619 | 4,996 | 8,734 | 72,645 |
Intergenic distances (coding) | NA | 2,163.73 | 187 | -10,136 | -571 | -34 | 70 | 284 | 866 | 2,143 | 4,500 | 6,878 | 20,134 | 490,690 |
From this summary it is quite apparent that the A. thaliana genome preferentially encodes multiexonic transcripts with a median CDS/UTR ratio of approximately 80%. Each transcript has on average 5 exons (median 3), and intron lengths are generally short - 165 bps on average and 100 as median, with a maximum value of ~11,000 bps for mRNAs. It is also a compact genome, with a median intergenic distance under 1 kbps and a modal distance of only 250 bps.
Step 1: configuring Daijin¶
The first task is to create a configuration file for Daijin using daijin configure
. On the command line, we have to configure the following:
- name of the configuration file (daijin.yaml)
- number of threads per process (5)
- reference genome and the name of the species (without spaces or non-ASCII/special characters)
- reads to use, with their strandedness and the name of the sample
- aligners (HISAT2) and assemblers (Stringtie, CLASS2) to use
- output directory (athaliana)
- the scoring file to use for Mikado pick; we will ask to copy it in-place to have a look at it (plants.yaml)
- the protein database for homology searches for Mikado (uniprot_sprot_plants.fasta)
- flank: as A. thaliana has a quite compact genome, we should decrease the maximum distance for grouping together transcripts. We will decrease from 1kbps (default) to 250.
- (optional) the scheduler to use for the cluster (we will presume SLURM, but we also support LSF and PBS)
- (optional) name of the cluster configuration file, which will have to be edited manually.
This can all be specified with the following command line:
daijin configure \
-o daijin.yaml \
--flank 250 \
--threads 5 \
--genome Arabidopsis_thaliana.TAIR10.dna.toplevel.fa --name "Athaliana" \
-od athaliana \
-r1 ERR588044_1.fastq.gz -r2 ERR588044_2.fastq.gz --strandedness fr-secondstrand -s ERR588044 \
-al hisat -as stringtie class \
--scoring plants.yaml --copy-scoring plants.yaml \
--prot-db uniprot_sprot_plants.fasta \
--scheduler SLURM \
-c slurm.yaml;
This will produce a file, daijin.yaml, which should specify the same parameters as this preformed example
.
If you are executing this program on a cluster, you will have to modify the “load” section and specify for each software which operations should be performed to make the tool available to the submitted script (eg. “source mikado-1.0.0” to make Mikado available). For example, this is how our load section looks like after editing:
load:
# Commands to use to load/select the versions of the programs to use. Leave an empty
# string if no loading is necessary.
blast: 'source blast-2.3.0'
class: 'source class-2.12'
cufflinks: ''
gmap: ''
hisat: 'source HISAT-2.0.4'
mikado: 'source mikado-devel'
portcullis: 'source portcullis-0.17.2'
samtools: 'source samtools-1.2'
star: ''
stringtie: 'source stringtie-1.2.4'
tophat: ''
transdecoder: 'source transdecoder-3.0.0'
trinity: ''
Also, if you are operating on a cluster, you might want to edit the cluster configuration file “slurm.conf”. This is how it appears:
__default__:
threads: 4
memory: 10000
queue: tgac-medium
asm_trinitygg:
memory: 20000
The most important parameter to modify is the queue to use - set it according to the configuration of your own cluster.
Step 2: running the assemble part¶
Now that we have created a proper configuration file, it is time to launch Daijin assemble and inspect the results. Issue the command:
daijin assemble -c slurm.yaml daijin.yaml
After checking that the configuration file is valid, Daijin will start the alignment and assembly of the dataset. On a normal desktop computer, this should take ~2 hours. Before launching the pipeline, you can obtain a graphical representation of the steps with:
daijin assemble -c slurm.yaml --dag daijin.yaml | dot -Tsvg > assemble.svg
You can also ask Daijin to display the steps to be executed, inclusive of their command lines, by issuing the following command:
daijin assemble -c slurm.yaml --dryrun daijin.yaml
When Daijin is finished, have a look inside the folder athaliana/3-assemblies/output/; you will find the following two GTF files:
- class-0-hisat–0.gtf
- stringtie-0-hisat–0.gtf
These are standard GTF files reporting the assembled transcripts for each method. We can have a feel for how they compare with our reference annotation by, again, using mikado util stats
. Conveniently, Daijin has already performed this analysis for us, and the files will be present in the same folder:
- class-0-hisat-ERR588044-0.gtf.stats
- stringtie-0-hisat-ERR588044-0.gtf.stats
These are the statistics for our CLASS run:
Stat | Total | Average | Mode | Min | 1% | 5% | 10% | 25% | Median | 75% | 90% | 95% | 99% | Max |
Number of genes | 29029 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of genes (coding) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of monoexonic genes | 3476 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Transcripts per gene | 35276 | 1.20 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 4 | 20 |
Coding transcripts per gene | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDNA lengths | 34375468 | 974.47 | 202 | 86 | 140 | 202 | 244 | 387 | 746 | 1,318 | 1,997 | 2,524 | 3,776 | 20,077 |
CDNA lengths (mRNAs) | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS lengths | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS lengths (mRNAs) | NA | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS/cDNA ratio | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Monoexonic transcripts | 3476 | 627.98 | 201 | 86 | 100 | 118 | 148 | 225 | 432 | 862 | 1,366 | 1,690 | 2,367 | 6,497 |
MonoCDS transcripts | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Exons per transcript | 166661 | 4.72 | 2 | 1 | 1 | 1 | 2 | 2 | 3 | 6 | 10 | 13 | 19 | 39 |
Exons per transcript (mRNAs) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Exon lengths | NA | 206.26 | 101 | 2 | 30 | 50 | 62 | 87 | 123 | 215 | 412 | 652 | 1,356 | 16,199 |
Exon lengths (mRNAs) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Intron lengths | NA | 203.58 | 87 | 20 | 68 | 75 | 79 | 86 | 101 | 176 | 353 | 501 | 1,441 | 9,973 |
Intron lengths (mRNAs) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS exons per transcript | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS exons per transcript (mRNAs) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS exon lengths | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS Intron lengths | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
5’UTR exon number | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
3’UTR exon number | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
5’UTR length | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
3’UTR length | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Stop distance from junction | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Intergenic distances | NA | 2,489.09 | -200 | -21,156 | -4,688 | -1,088 | -505 | -122 | 600 | 2,526 | 6,343 | 10,431 | 26,002 | 787,832 |
Intergenic distances (coding) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
and these for our Stringtie run:
Stat | Total | Average | Mode | Min | 1% | 5% | 10% | 25% | Median | 75% | 90% | 95% | 99% | Max |
Number of genes | 40566 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of genes (coding) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Number of monoexonic genes | 11898 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Transcripts per gene | 55055 | 1.33 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 5 | 15 |
Coding transcripts per gene | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDNA lengths | 71035478 | 1,290.26 | 200 | 200 | 210 | 248 | 308 | 535 | 1,053 | 1,753 | 2,556 | 3,180 | 4,709 | 23,560 |
CDNA lengths (mRNAs) | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS lengths | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS lengths (mRNAs) | NA | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS/cDNA ratio | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Monoexonic transcripts | 12251 | 635.23 | 200 | 200 | 203 | 215 | 229 | 292 | 452 | 794 | 1,295 | 1,660 | 2,502 | 15,487 |
MonoCDS transcripts | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Exons per transcript | 280409 | 5.09 | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 7 | 11 | 15 | 22 | 57 |
Exons per transcript (mRNAs) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Exon lengths | NA | 253.33 | 72 | 2 | 27 | 49 | 62 | 87 | 145 | 293 | 548 | 811 | 1,585 | 19,610 |
Exon lengths (mRNAs) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Intron lengths | NA | 181.30 | 87 | 20 | 69 | 75 | 79 | 86 | 100 | 167 | 335 | 467 | 967 | 9,958 |
Intron lengths (mRNAs) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS exons per transcript | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CDS exons per transcript (mRNAs) | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS exon lengths | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CDS Intron lengths | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
5’UTR exon number | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
3’UTR exon number | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
5’UTR length | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
3’UTR length | 0.0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Stop distance from junction | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Intergenic distances | NA | 1,153.56 | 54 | -21,384 | -7,051 | -3,616 | -2,547 | -1,094 | 272 | 1,729 | 4,661 | 7,504 | 19,125 | 574,775 |
Intergenic distances (coding) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
From this quick analysis, it looks like both assemblers have reconstructed reasonable transcripts, on average. We can try to gauge how similar either assembly is to known models by using Mikado compare. To do so, issue the following commands:
# First index the reference GFF3
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 --index;
# Compare the CLASS2 assembly against the reference
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/class-0-hisat-ERR588044-0.log
# Compare the Stringtie assembly against the reference
mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare.log
The analysis will produce TMAP, REFMAP and STATS files for each of the two assemblies. This is the report of the stats file for the CLASS2 assembly,
41671 reference RNAs in 33602 genes
35276 predicted RNAs in 29029 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 40.73 80.36 54.06
Exon level (stringent): 43.80 56.22 49.24
Exon level (lenient): 65.81 74.31 69.80
Intron level: 68.00 86.79 76.25
Intron chain level: 28.37 26.88 27.61
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 2.30 2.70 2.48
Transcript level (>=80% base F1): 16.33 19.19 17.64
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 2.77 3.20 2.97
Gene level (>=80% base F1): 19.43 22.37 20.80
# Matching: in prediction; matched: in reference.
Matching intron chains: 8549
Matched intron chains: 8575
Matching monoexonic transcripts: 870
Matched monoexonic transcripts: 883
Total matching transcripts: 9419
Total matched transcripts: 9458
Missed exons (stringent): 95133/169282 (56.20%)
Novel exons (stringent): 57739/131888 (43.78%)
Missed exons (lenient): 52381/153221 (34.19%)
Novel exons (lenient): 34867/135707 (25.69%)
Missed introns: 40932/127896 (32.00%)
Novel introns: 13234/100198 (13.21%)
Missed transcripts: 15490/41671 (37.17%)
Novel transcripts: 99/35276 (0.28%)
Missed genes: 14849/33602 (44.19%)
Novel genes: 70/29029 (0.24%)
and this instead for Stringtie:
/tgac/software/testing/bin/core/../..//mikado/devel/x86_64/bin/mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.gtf -o athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare -l athaliana/3-assemblies/output/stringtie-0-hisat-ERR588044-0.compare.log
41671 reference RNAs in 33602 genes
55055 predicted RNAs in 40566 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 55.21 59.42 57.23
Exon level (stringent): 45.31 38.81 41.81
Exon level (lenient): 68.12 57.04 62.09
Intron level: 72.94 64.23 68.31
Intron chain level: 42.54 29.94 35.15
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 21.26 18.02 19.50
Transcript level (>=80% base F1): 34.77 32.56 33.63
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 24.76 22.49 23.57
Gene level (>=80% base F1): 40.20 40.28 40.24
# Matching: in prediction; matched: in reference.
Matching intron chains: 12817
Matched intron chains: 12854
Matching monoexonic transcripts: 1827
Matched monoexonic transcripts: 1836
Total matching transcripts: 14644
Total matched transcripts: 14690
Missed exons (stringent): 92576/169282 (54.69%)
Novel exons (stringent): 120959/197665 (61.19%)
Missed exons (lenient): 49435/155043 (31.88%)
Novel exons (lenient): 79549/185157 (42.96%)
Missed introns: 34606/127896 (27.06%)
Novel introns: 51958/145248 (35.77%)
Missed transcripts: 12822/41671 (30.77%)
Novel transcripts: 155/55055 (0.28%)
Missed genes: 12288/33602 (36.57%)
Novel genes: 138/40566 (0.34%)
These comparisons suggest the following:
- Most of the transcripts reconstructed by both assemblers are near enough known genes so as not to be categorised as completely novel. Only 70 genes for CLASS2 and 130 genes for Stringtie are in non-annotated genomic loci.
- CLASS2 has performed worse than Stringtie for this particular species and samples - the number of reconstructed transcript is lower, and so is the F1 at the transcript and gene level.
- Although the number of novel loci is low for both CLASS and Stringtie, the high number of loci
Step 3: running the Mikado steps¶
Now that we have created the input assemblies, it is time to run Mikado. First of all, let us have a look at the spombe.yaml file, which we have conveniently copied to our current directory:
requirements:
#expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and (cdna_length.multi, or, combined_cds_length.multi) and max_intron_length, and, (min_intron_length and proportion_verified_introns_inlocus), or exon_num.mono and (combined_cds_length.mono or cdna_length.mono))]
expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and (cdna_length.multi, or, combined_cds_length.multi) and max_intron_length, and, min_intron_length and proportion_verified_introns_inlocus ) or (exon_num.mono and (combined_cds_length.mono or cdna_length.mono))) ]
parameters:
combined_cds_fraction.ncrna: {operator: eq, value: 0}
combined_cds_fraction.coding: {operator: gt, value: 0.35}
cdna_length.mono: {operator: gt, value: 200}
cdna_length.multi: {operator: ge, value: 100}
combined_cds_length.mono: {operator: gt, value: 100}
combined_cds_length.multi: {operator: gt, value: 50}
exon_num.mono: {operator: eq, value: 1}
exon_num.multi: {operator: gt, value: 1}
max_intron_length: {operator: le, value: 20000}
min_intron_length: {operator: ge, value: 5}
proportion_verified_introns_inlocus: {operator: gt, value: 0}
as_requirements:
expression: [cdna_length and three_utr_length and five_utr_length and utr_length and suspicious_splicing]
parameters:
cdna_length: {operator: ge, value: 200}
utr_length: {operator: le, value: 2500}
five_utr_length: {operator: le, value: 2500}
three_utr_length: {operator: le, value: 2500}
suspicious_splicing: {operator: ne, value: true}
not_fragmentary:
expression: [((exon_num.multi and (cdna_length.multi or combined_cds_length.multi)), or, (exon_num.mono and combined_cds_length.mono))]
parameters:
is_complete: {operator: eq, value: true}
exon_num.multi: {operator: gt, value: 1}
cdna_length.multi: {operator: ge, value: 200}
combined_cds_length.multi: {operator: gt, value: 150}
exon_num.mono: {operator: eq, value: 1}
combined_cds_length.mono: {operator: gt, value: 200}
# expression: [combined_cds_length]
# parameters:
# combined_cds_length: {operator: gt, value: 300}
scoring:
blast_score: {rescaling: max}
# snowy_blast_score: {rescaling: max}
cdna_length: {rescaling: max}
cds_not_maximal: {rescaling: min}
cds_not_maximal_fraction: {rescaling: min}
# exon_fraction: {rescaling: max}
exon_num: {
rescaling: max,
filter: {
operator: ge,
value: 3}
}
five_utr_length:
filter: {operator: le, value: 2500}
rescaling: target
value: 100
five_utr_num:
filter: {operator: lt, value: 4}
rescaling: target
value: 2
end_distance_from_junction:
filter: {operator: lt, value: 55}
rescaling: min
highest_cds_exon_number: {rescaling: max}
intron_fraction: {rescaling: max}
is_complete: {rescaling: target, value: true}
number_internal_orfs: {rescaling: target, value: 1}
# proportion_verified_introns: {rescaling: max}
non_verified_introns_num: {rescaling: min}
proportion_verified_introns_inlocus: {rescaling: max}
retained_fraction: {rescaling: min}
retained_intron_num: {rescaling: min}
selected_cds_fraction: {rescaling: target, value: 0.8}
selected_cds_intron_fraction: {rescaling: max}
selected_cds_length: {rescaling: max}
selected_cds_num: {rescaling: max}
three_utr_length:
filter: {operator: le, value: 2500}
rescaling: target
value: 200
three_utr_num:
filter: {operator: lt, value: 3}
rescaling: target
value: 1
combined_cds_locus_fraction: {rescaling: max}
With this file, we are telling Mikado that we are looking for transcripts with a low number of exons (“exon_num: {rescaling: min}”), with a ratio of CDS/UTR of ~80% (“selected_cds_fraction: {rescaling: target, value: 0.8}”) and good homology support (“blast_score: {rescaling: max}”). We are also rewarding long cDNAs (“cdna_length: {rescaling: max}”) with a good homology to known proteins (“blast_score: {rescaling: max}”) and only one ORF (“cds_not_maximal: {rescaling: min}”; “number_internal_orfs: {rescaling: target, value: 1}”). This set of rules is tailored for compact fungal genomes, where this kind of models is expected at a much higher rate than in other eukaryotes (eg plants or mammals).
Before launching daijin mikado
, we can have a look at the BLAST and TransDecoder sections. It is advisable to modify the number of chunks for BLASTX to a high number, eg. 100, if you are using a cluster. Please note that now we are going to use a different configuration file, athaliana/mikado.yaml, which Daijin created as last step in the assembly pipeline.
Issue the command:
daijin mikado -c slurm.yaml athaliana/mikado.yaml
This part of the pipeline should be quicker than the previous stage. After the pipeline is finished, Daijin will have created the final output files in athaliana/5-mikado/pick/. As we requested only for the permissive mode, we only have one output - athaliana/5-mikado/pick/mikado-permissive.loci.gff3. These are basic statistics on this annotation:
File | genes | monoexonic_genes | transcripts | transcripts_per_gene | transcript_len_mean | monoexonic_transcripts | exons | exons_per_transcript | exon_len_mean |
---|---|---|---|---|---|---|---|---|---|
mikado-permissive.loci.stats | 5633 | 2915 | 6054 | 1.07 | 1642.08 | 3095 | 11850 | 1.96 | 838.92 |
Mikado has created an annotation that is in between those produced by Stringtie and CLASS2. Compared to CLASS2, the Mikado assembly is probably more comprehensive, given the higher number of genes. Half the genes are monoexonic, an improvement compared to CLASS2, and at the same time the number of genes is much lower than in Stringtie, with a higher average cDNA length. These statistics suggest that the Mikado filtered annotation has been able to retain most of the real genes, while discarding many of the fragments present in either assembly. We can verify this by comparing the Mikado results against the reference annotation:
mikado compare -r Schizosaccharomyces_pombe.ASM294v2.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/mikado_prepared.compare -l athaliana/5-mikado/mikado_prepared.compare.log
A cursory look at the STATS file confirms the impression; the Mikado annotation has removed many of the spurious transcripts and is quite more precise than either of the input assemblies, while retaining their sensitivity:
Command line:
/tgac/software/testing/bin/core/../..//mikado/devel/x86_64/bin/mikado compare -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.log
41671 reference RNAs in 33602 genes
30806 predicted RNAs in 24730 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 53.43 87.95 66.48
Exon level (stringent): 45.44 58.77 51.26
Exon level (lenient): 68.71 86.45 76.57
Intron level: 73.13 93.43 82.05
Intron chain level: 43.73 53.48 48.11
Transcript level (stringent): 0.01 0.01 0.01
Transcript level (>=95% base F1): 18.87 25.49 21.69
Transcript level (>=80% base F1): 33.71 45.78 38.83
Gene level (100% base F1): 0.01 0.02 0.01
Gene level (>=95% base F1): 22.36 30.35 25.75
Gene level (>=80% base F1): 39.52 53.95 45.62
# Matching: in prediction; matched: in reference.
Matching intron chains: 13173
Matched intron chains: 13213
Matching monoexonic transcripts: 1702
Matched monoexonic transcripts: 1710
Total matching transcripts: 14875
Total matched transcripts: 14923
Missed exons (stringent): 92353/169282 (54.56%)
Novel exons (stringent): 53962/130891 (41.23%)
Missed exons (lenient): 48471/154914 (31.29%)
Novel exons (lenient): 16682/123125 (13.55%)
Missed introns: 34362/127896 (26.87%)
Novel introns: 6576/100110 (6.57%)
Missed transcripts: 13373/41671 (32.09%)
Novel transcripts: 132/30806 (0.43%)
Missed genes: 12824/33602 (38.16%)
Novel genes: 120/24730 (0.49%)
Moreover, Mikado models have an ORF assigned to them. We can ask Mikado compare to consider only the coding component of transcripts with the following command line:
mikado compare -eu -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu.log
The statistics file looks as follows:
Command line:
mikado compare -eu -r Arabidopsis_thaliana.TAIR10.32.gff3 -p athaliana/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu -l athaliana/5-mikado/pick/permissive/mikado-permissive.loci.compare.eu.log
41671 reference RNAs in 33602 genes
30806 predicted RNAs in 24730 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 53.00 93.07 67.54
Exon level (stringent): 60.90 78.70 68.67
Exon level (lenient): 69.57 88.37 77.85
Intron level: 73.44 94.86 82.79
Intron chain level: 46.39 57.65 51.41
Transcript level (stringent): 25.84 34.31 29.48
Transcript level (>=95% base F1): 35.32 47.28 40.43
Transcript level (>=80% base F1): 39.45 53.03 45.25
Gene level (100% base F1): 26.78 36.40 30.86
Gene level (>=95% base F1): 37.63 51.16 43.37
Gene level (>=80% base F1): 42.29 57.54 48.75
# Matching: in prediction; matched: in reference.
Matching intron chains: 13988
Matched intron chains: 14097
Matching monoexonic transcripts: 2687
Matched monoexonic transcripts: 2693
Total matching transcripts: 16675
Total matched transcripts: 16790
Missed exons (stringent): 61443/157150 (39.10%)
Novel exons (stringent): 25902/121609 (21.30%)
Missed exons (lenient): 44733/146988 (30.43%)
Novel exons (lenient): 13456/115711 (11.63%)
Missed introns: 32026/120580 (26.56%)
Novel introns: 4795/93349 (5.14%)
Missed transcripts: 13627/41671 (32.70%)
Novel transcripts: 141/30806 (0.46%)
Missed genes: 13060/33602 (38.87%)
Novel genes: 127/24730 (0.51%)
The similarity is quite higher, suggesting that for many models the differences between the Mikado annotation and the reference lies in the UTR component.
We suggest to visualise assemblies with one of the many tools currently at disposal, such as eg WebApollo [Apollo]. Mikado files are GFF3-compliant and can be loaded directly into Apollo or similar tools. GTF files can be converted into proper GFF3 files using the convert utility:
mikado util convert <input gtf> <output GFF3>
Usage¶
Mikado is composed of four different programs (configure, prepare, serialise, pick) which have to be executed serially to go from an ensemble of different assemblies to the final dataset. In addition to these core programs, Mikado provides a utility to compare annotations, similarly to CuffCompare and ParsEval (compare), and various other minor utilities to perform operations such as extracting regions from a GFF, convert between different gene annotation formats, etc.
Mikado pipeline stages¶
Mikado configure¶
This utility prepares the configuration file that will be used throughout the pipeline stages. While the most important options can be set at runtime through the command line, many algorithmic details can be accessed and intervened upon only through the file produced through this command.
This command will generate a configuration file (in either JSON or YAML format), with the correct configuration for the parameters set on the command line. See the in-depth section on the structure of the configuration file for details.
Command line parameters:
- full: By default, Mikado configure will output a stripped-down configuration file, with only some of the fields explicitly present. Use this flag to show all the available fields.
Hint
If a parameter is not explicitly present, its value will be set to the available default.
- external: an external JSON/YAML file from which Mikado should recover parameters. If something is specified both on the command line and in the external configuration file, command line options take precedence.
- reference: Reference genome FASTA file. Required for the correct functioning of Mikado.
- gff: list of GFFs/GTFs of the assemblies to use for Mikado, comma separated.
- labels: optional list of labels to be assigned to each assembly. The label will be also the “source” field in the output of Mikado prepare.
- strand-specific-assemblies: list of strand specific assemblies among those specified with the gff flag.
- strand-specific: flag. If set, all assemblies will be treated as strand-specific.
- list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with the following contents:
- Location of the file
- label for the file
- whether that assembly is strand-specific or not (write True/False)
- Optionally, a bonus/malus to be associated to transcripts coming from that assembly.
- -j, json: flag. If present, the output file will be in JSON rather that YAML format.
- mode: in which mode Mikado pick will run. See the relative section for further details.
- scoring: scoring file to be used for selection. At the moment, four configuration files are present:
- plants
- human
- worm
- insects
- copy-scoring: this flag specifies a file the scoring file will be copied to. Useful for customisation on new species. The configuration file will account for this and list the location of the copied file as the scoring file to be used for the run.
- junctions: Reliable junctions derived from the RNA-Seq alignments. Usually obtained through Portcullis, but we also support eg junctions from Tophat2. This file must be in BED12 format.
- blast_targets: FASTA database of blast targets, for the serialisation stage.
Usage:
$ mikado configure --help
usage: Mikado configure [-h] [--full]
[--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}]
[--copy-scoring COPY_SCORING] [--strand-specific]
[--no-files | --gff GFF | --list LIST]
[--reference REFERENCE] [--junctions JUNCTIONS]
[-bt BLAST_TARGETS]
[--strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES]
[--labels LABELS] [--external EXTERNAL]
[--mode {nosplit,stringent,lenient,permissive,split}]
[-j]
[out]
This utility guides the user through the process of creating a configuration file for Mikado.
positional arguments:
out
optional arguments:
-h, --help show this help message and exit
--full
--strand-specific Boolean flag indicating whether all the assemblies are strand-specific.
--no-files Remove all files-specific options from the printed configuration file.
Invoking the "--gff" option will disable this flag.
--gff GFF Input GFF/GTF file(s), separated by comma
--list LIST List of the inputs, one by line, in the form:
<file1> <label> <strandedness (true/false)>
--reference REFERENCE
Fasta genomic reference.
--strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
List of strand-specific assemblies among the inputs.
--labels LABELS Labels to attach to the IDs of the transcripts of the input files,
separated by comma.
--external EXTERNAL External configuration file to overwrite/add values from.
Parameters specified on the command line will take precedence over those present in the configuration file.
--mode {nosplit,stringent,lenient,permissive,split}
Mode in which Mikado will treat transcripts with multiple ORFs.
- nosplit: keep the transcripts whole.
- stringent: split multi-orf transcripts if two consecutive ORFs have both BLAST hits
and none of those hits is against the same target.
- lenient: split multi-orf transcripts as in stringent, and additionally, also when
either of the ORFs lacks a BLAST hit (but not both).
- permissive: like lenient, but also split when both ORFs lack BLAST hits
- split: split multi-orf transcripts regardless of what BLAST data is available.
-j, --json Output will be in JSON instead of YAML format.
Options related to the scoring system:
--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}
Available scoring files.
--copy-scoring COPY_SCORING
File into which to copy the selected scoring file, for modification.
Options related to the serialisation step:
--junctions JUNCTIONS
-bt BLAST_TARGETS, --blast_targets BLAST_TARGETS
The guide here describes all voices of the configuration file. However, the configuration created by default by mikado configure
is much simpler
This section deals with the database settings that will be necessary for the serialisation and picking phases of the pipeline. By default, Mikado will use a SQLite database, but it currently also supports MySQL and PostgreSQL through SQLAlchemy. Fields:
- db: name of the database to use. In case the database is SQLite, this will be the database file, otherwise it will be the database name.
- dbtype: one of: * sqlite * mysql * postgresql
- dbhost: host where the database is located. Required with MySQL and PostgreSQL.
- dbuser: User of the database. Required with MySQL and PostgreSQL.
- dbpasswd: Database password. Required with MySQL and PostgreSQL.
- dbport: Port to access to the database. It defaults to the normal ports for the selected database.
db_settings:
# Settings related to DB connection. Parameters:
# db: the DB to connect to. Required. Default: mikado.db
# dbtype: Type of DB to use. Choices: sqlite, postgresql, mysql. Default: sqlite.
# dbhost: Host of the database. Unused if dbtype is sqlite. Default: localhost
# dbuser: DB user. Default:
# dbpasswd: DB password for the user. Default:
# dbport: Integer. It indicates the default port for the DB.
db: /usr/users/ga002/venturil/workspace/mikado/docs/mikado.db
dbhost: localhost
dbpasswd: ''
dbport: 0
dbtype: sqlite
dbuser: ''
This section of the configuration file deals with the reference genome. It specifies two fields:
- genome: the genome FASTA file. Required.
- genome_fai: FAI index of the genome. Used by Mikado serialise, it can be inferred if left null.
- transcriptome: optional annotation file for the genome. Mikado currently ignores this field, but it is used by Daijin to guide some of the RNA-Seq assemblies.
reference:
# Options related to the reference genome.
genome: ''
genome_fai: ''
transcriptome: ''
This section of the configuration file deals with the prepare stage of Mikado. It specifies the input files, their labels, and which of them are strand specific. The available fields are the following:
- canonical: this voice specifies the splice site donors and acceptors that are considered canonical for the species. By default, Mikado uses the canonical splice site (GT/AG) and the two semi-canonical pairs (GC/AG and AT/AC). Type: Array of two-element arrays, composed by two-letter strings.
- keep_redundant: if set to false (default), Mikado will only keep one copy of transcripts that are completely identical.
- lenient: boolean value. If set to false, transcripts that either only have non-canonical splice sites or have a mixture of canonical junctions on both strands will be removed from the output. Otherwise, they will left in, be properly tagged.
- minimum_length: minimum length of the transcripts to be kept.
- procs: number of processors to be used.
- strand_specific: boolean. If set to true, all input assemblies will be treated as strand-specific, therefore keeping the strand of monoexonic fragments as it was. Multiexonic transcripts will not have their strand reversed even if doing that would mean making some or all non-canonical junctions canonical.
- strip_cds: boolean. If set to true, the CDS features will be stripped off the input transcripts. This might be necessary for eg transcripts obtained through alignment with GMAP [GMAP].
- files: this sub-section is the most important, as it contains among other things the locations and labels for the input files. Voices:
- gff: array of the input files, in GFF or GTF format. Please note that only CDS/exon/UTR features will be considered from these files.
- labels: optional array of the labels to be assigned to the input files. If non-empty, it must be of the same order and length of the gff array, and be composed of unique elements. The labels will be used in two ways: * as a prefix of the transcripts coming from the corresponding GFF * as the source field assigned to the transcript. This might be of relevance during the picking stage.
- log: name of the log file.
- out: name of the GTF output file.
- out_fasta: name of the corresponding output FASTA file.
- output_dir: output directory. It will be created if it does not exist already.
- strand_specific_assemblies: array of the names of the GFF/GTF files that are strand specific. All the file names in this array must also appear in the gff array as well..
prepare:
# Options related to the input data preparation.
# - files: options relative to the input/output files.
# - procs: Number of processes to use.
# - strip_cds: whether to remove the CDS from the predictions during preparation.
# - lenient: if set to True, invalid transcripts will be only flagged and not removed.
# EXPERIMENTAL.
# - strand_specific: if set to True, transcripts will be assumed to be in the correct
# orientation, no strand flipping or removal
# - strand_specific_assemblies: array of input predictions which are to be considered
# as strand-specific.
# Predictions not in this list will be considered as non-strand-specific.
# - canonical: canonical splice sites, to infer the correct orientation.
canonical:
- - GT
- AG
- - GC
- AG
- - AT
- AC
files:
# Options related to the input and output files.
# - out: output GTF file
# - out_fasta: output transcript FASTA file
# - gff: array of input predictions for this step.
# - log: output log. Default: prepare.log
# - labels: labels to be associated with the input GFFs. Default: None.
gff: []
labels: []
log: prepare.log
out: mikado_prepared.gtf
out_fasta: mikado_prepared.fasta
output_dir: .
strand_specific_assemblies: []
keep_redundant: false
lenient: false
minimum_length: 200
procs: 1
single: false
strand_specific: false
strip_cds: false
This section of the configuration file deals with the serialisation stage of Mikado. It specifies the location of the ORF BED12 files from TransDecoder, the location of the XML files from BLAST, the location of portcullis junctions, and other details important at run time. It has the following voices:
- discard_definition: boolean. This is used to specify whether we will use the ID or the definition of the sequences when parsing BLAST results. This is important when BLAST data might have a mock, local identifier for the sequence (“lcl|1”) rather than its original ID. :warning: Deprecated since v1 beta 10.
- force: whether the database should be truncated and rebuilt, or just updated.
- max_objects: this parameter is quite important when running with a SQLite database. SQLite does not support caching on the disk before committing the changes, so that every change has to be kept in memory. This can become a problem for RAM quite quickly. On the other hand, committing is an expensive operation, and it makes sense to minimise calls as much as possible. This parameter specifies the maximum number of objects Mikado will keep in memory before committing them to the database. The default number, 100,000, should ensure that Mikado runs with less than 1GB memory. Increase it to potentially increase speed at the price of greater memory usage; for example, increasing it to 1,000,000 will cause Mikado to use ~6GB of RAM at its peak usage.
- max_regression: this parameter is a float comprised between 0 and 1. TransDecoder will sometimes output open ORFs even in the presence of an in-frame start codon. Mikado can try to “regress” along the ORF until it finds one such start codon. This parameter imposes how much Mikado will regress, in percentage of the cDNA length.
- max_target_seqs: equivalent to the BLAST+ parameter of the same name - it indicates the maximum number of discrete hits that can be assigned to one sequence in the database.
- procs: number of processors to use. Most important for serialising BLAST+ files.
- single_thread: boolean, if set to true it will forcibly disable multi-threading. Useful mostly for debugging purposes.
- files: this sub-section codifies the location of the input files for serialise. It contains the following voices:
- junctions: array of locations of reliable junction files. These must be in BED12 format.
- log: log file.
- orfs: array of locations of ORFs location on the cDNA, as created by eg TransDecoder [Trinity].
- output_dir: output directory where the log file and the SQLite database will be written to (if SQLite has been chosen as the database type)
- transcripts: input transcripts. This should be set to be equal to the output of Mikado prepare, ie the “out_fasta” field of the prepare section of the configuration file.
- xml: this array indicates the location of the BLAST output file. Elements of the array can be:
- BLAST+ XML files (optionally compressed with gzip)
- BLAST+ ASN files (optionally compressed with gzip), which will be converted in-memory using
blast_formatter
- a folder containing files of the above types.
serialise:
# Options related to serialisation
# - force: whether to drop and reload everything into the DB
# - files: options related to input files
# - max_objects: Maximum number of objects to keep in memory while loading data
# into the database
# - max_regression: if the ORF lacks a valid start site, this percentage indicates
# how far
# along the sequence Mikado should look for a good start site. Eg. with a value
# of 0.1,
# on a 300bp sequence with an open ORF Mikado would look for an alternative in-frame
# start codon
# in the first 30 bps (10% of the cDNA).
# - max_target_seqs: equivalently to BLAST, it indicates the maximum number of
# targets to keep
# per blasted sequence.
# - discard_definition: Boolean. Used to indicate whether Mikado should use the
# definition
# rather than the ID for BLAST sequences. Necessary as in some instances BLAST
# XMLs will have
# a mock identifier rather than the original sequence ID (eg lcl|1). Default:
# false.
# - procs: Number of processors to use. Default: 1.
# - single_thread: if true, Mikado prepare will force the usage of a single thread
# in this step.
files:
blast_targets:
- ''
junctions: []
log: serialise.log
orfs:
- ''
output_dir: .
transcripts: mikado_prepared.fasta
xml:
- ''
force: false
max_objects: 100000
max_regression: 0
max_target_seqs: 100000
procs: 1
single_thread: false
Hint
The most expensive operation in a “Mikado serialise” run is by far the serialisation of the BLAST files. Splitting the input files in multiple chunks, and analysing them separately, allows Mikado to parallelise the analysis of the BLAST results. If a single monolythic XML/ASN file is produced, by contrast, Mikado will be quite slow as it will have to parse it all.
This section of the configuration file deals with the picking stage of Mikado. It specifies details on how to handle BLAST and ORF data, which alternative splicing events are considered as valid during the final stages of the picking, and other important algorithmic details. The section comprises the following subsections:
- alternative_splicing: Options related to which AS events are considered as valid for the primary transcript in a locus.
- chimera_split: Options related to how to handle transcripts with multiple valid ORFs.
- files: Input and output files.
- orf_loading: Options related to how to decide which ORFs to load onto each transcript.
- output_format: options related to how to format the names of the transcripts, the source field of the GFFs, etc.
- run_options: Generic options related either to the general algorithm or to the number of resources requested.
- scoring_file: This value specifies the scoring file to be used for Mikado. These can be found in Mikado.configuration.scoring_files.
Hint
It is possible to ask for the configuration file to be copied in-place for customisation when calling mikado configure
.
- source_score: in this section, it is possible to specify boni/mali to be assigned to specific labels. Eg, it might be possible to assign a bonus of 1 to any transcript coming from PacBio reads, or a malus to any transcript coming from a given assembler. Example of such a configuration:
pick:
source_score:
- Cufflinks: 0
- Trinity: 0
- PacBio: 2
- Stringtie: 1
In this example, we asked Mikado to consider Stringtie transcripts as more trustworthy than the rest (1 additional point), and PacBio transcripts even more so (2 additional points).
Each subsection of the pick configuration will be explained in its own right.
After selecting the best model for each locus, Mikado will backtrack and try to select valid alternative splicing events. This section deals with how Mikado will operate the selection. In order to be considered as valid potential AS events, transcripts have to satisfy the minimum requirements specified in the scoring file. These are the available parameters:
- report: boolean. Whether to calculate and report possible alternative splicing events at all. By default this is set to true; setting this parameter to false will inactivate all the options in this section.
- keep_retained_introns: boolean. It specifies whether transcripts with retained introns will be retained. A retained intron is defined as an exon at least partly non-coding, whose non-coding part falls within the intron of another transcript (so, retained intron events which yield a valid ORF will not be excluded). By default, such transcripts will be excluded.
- min_cdna_overlap: minimum cDNA overlap between the primary transcript and the AS candidate. By default, this is set to 0 and we rely only on the class code and the CDS overlap. It must be a number between 0 and 1.
- min_cds_overlap: minimum CDS overlap between the primary transcript and the AS candidate. By default this is set to 0.6, ie 60%. It must be a number between 0 and 1.
- min_score_perc: Minimum percentage of the score of the primary transcript that any candidate AS must have to be considered. By default, this is set to 0.6 (60%). It must be a number between 0 and 1.
- only_confirmed_introns: boolean. This parameter determines whether to consider only transcripts whose introns are confirmed in the dataset of reliable junctions, or whether to consider all possible candidate transcripts.
- redundant_ccodes: any candidate AS will be compared against all the transcripts already retained in the locus. If any of these comparisons returns one of the class codes specified in this array, the transcript will be ignored. Default class codes: =, _, m, c, n, C
- valid_ccodes: any candidate AS will be compared against the primary transcript to determine the type of AS event. If the class code is one of those specified in this array, the transcript will be considered further. Valid class codes are within the categories “Alternative splicing”, “Extension” with junction F1 lower than 100%, and Overlap (with the exclusion of “m”). Default class codes: j, J, g, G, h.
- pad: boolean option. If set to True, Mikado will try to pad transcripts so that they share the same 5’. Disabled by default.
- ts_max_splices: numerical. When padding is activated, at most how many splice junctions can the extended exon cross?
- ts_distance: numerical. When padding is activated, at most of how many base pairs can an exon be extended?
Warning
the AS transcript event does not need to be a valid AS event for all transcripts in the locus, only against the primary transcript.
alternative_splicing:
# Parameters related to alternative splicing reporting.
# - report: whether to report at all or not the AS events.
# - min_cds_overlap: minimum overlap between the CDS of the primary transcript
# and any AS event. Default: 60%.
# - min_cdna_overlap: minimum overlap between the CDNA of the primary transcript
# and any AS event.
# Default: 0% i.e. disabled, we check for the CDS overlap.
# - keep_retained_introns: Whether to consider as valid AS events where one intron
# is retained compared to the primary or any other valid AS. Default: false.
# - max_isoforms: Maximum number of isoforms per locus. 1 implies no AS reported.
# Default: 3
# - valid_ccodes: Valid class codes for AS events. Valid codes are in categories
# Alternative splicing, Extension (with junction F1 lower than 100%),
# and Overlap (exluding m). Default: j, J, g, G, C, h
# - max_utr_length: Maximum length of the UTR for AS events. Default: 10e6 (i.e.
# no limit)
# - max_fiveutr_length: Maximum length of the 5UTR for AS events. Default:
# 10e6 (i.e. no limit)
# - max_threeutr_length: Maximum length of the 5UTR for AS events. Default:
# 10e6 (i.e. no limit)
# - min_score_perc: Minimum score threshold for subsequent AS events.
# Only transcripts with a score at least (best) * value are retained.
# - only_confirmed_introns: bring back AS events only when their introns are
# either
# present in the primary transcript or in the set of confirmed introns.
# - pad: boolean switch. If true, Mikado will pad all the transcript in a gene
# so that their ends are the same
# - ts_distance: if padding, this is the maximum distance in base-pairs between
# the starts of transcripts
# to be considered to be padded together.
# - ts_max_splices: if padding, this is the maximum amount of splicing junctions
# that the transcript to pad
# is allowed to cross. If padding would lead to cross more than this number,
# the transcript will not be padded.
keep_retained_introns: false
max_isoforms: 5
min_cdna_overlap: 0.5
min_cds_overlap: 0.75
min_score_perc: 0.5
only_confirmed_introns: true
pad: false
redundant_ccodes:
- c
- m
- _
- '='
- n
report: true
ts_distance: 300
ts_max_splices: 1
valid_ccodes:
- j
- J
- C
- G
- g
- h
Note
New in version 1 beta 10.
This section influences how Mikado clusters transcripts in its multi-stage selection. The available parameters are:
- flank: numerical. When constructing Superloci, Mikado will use this value as the maximum distance
between transcripts for them to be integrated within the same superlocus. * cds_only: boolean. If set to true, during the picking stage Mikado will consider only the primary ORF to evaluate whether two transcripts intersect. Transcripts which eg. share introns in their UTR but have completely unrelated CDSs will be clustered separately. Disabled by default. * purge: boolean. If true, any transcript failing the specified requirements will be purged out. Otherwise, they will be assigned a score of 0 and might potentially appear in the final output, if no other transcript is present in the locus. * simple_overlap_for_monoexonic: boolean. During the second clustering, by default monoexonic transcripts are clustered together even if they have a very slight overlap with another transcript. Manually setting this flag to false will cause Mikado to cluster monoexonic transcripts only if they have a minimum amount of cDNA and CDS overlap with the other transcripts in the holder. * min_cdna_overlap: numerical, between 0 and 1. Minimum cDNA overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail. * min_cdna_overlap: numerical, between 0 and 1. Minimum CDS overlap between two multiexonic transcripts for them to be considered as intersecting, if all other conditions fail.
clustering:
# Parameters related to the clustering of transcripts into loci.
# - cds_only: boolean, it specifies whether to cluster transcripts only according
# to their CDS (if present).
# - min_cds_overlap: minimal CDS overlap for the second clustering.
# - min_cdna_overlap: minimal cDNA overlap for the second clustering.
# - flank: maximum distance for transcripts to be clustered within the same superlocus.
# - remove_overlapping_fragments: boolean, it specifies whether to remove putative
# fragments.
# - purge: boolean, it specifies whether to remove transcripts which fail the
# minimum requirements check - or whether to ignore those requirements altogether.
# - simple_overlap_for_monoexonic: boolean. If set to true (default), then any
# overlap mean inclusion
# in a locus for or against a monoexonic transcript. If set to false, normal controls
# for the percentage
# of overlap will apply.
# - max_distance_for_fragments: maximum distance from a valid locus for another
# to be considered a fragment.
cds_only: false
flank: 200
min_cdna_overlap: 0.2
min_cds_overlap: 0.2
purge: true
simple_overlap_for_monoexonic: true
This section determines how Mikado treats potential fragments in the output. Available options:
- remove: boolean, default true. If set to true, fragments will be excluded from the final output; otherwise, they will be printed out, but properly tagged.
- max_distance: numerical. For non-overlapping fragments, this value determines the maximum distance from the valid gene. Eg. with the default setting of 2000, a putative fragment at the distance of 1000 will be tagged and dealt with as a fragment; an identical model at a distance of 3000 will be considered as a valid gene and left untouched.
- valid_class_codes: valid class codes for potential fragments. Only Class Codes in the categories Overlap, Intronic, Fragment, with the addition of “_”, are considered as valid choices.
fragments:
# Parameters related to the handling of fragments.
# - remove: boolean. Whether to remove fragments or leave them, properly tagged.
# - max_distance: maximum distance of a putative fragment from a valid gene.
# - valid_class_codes: which class codes will be considered as fragments. Default:
# (p, P, x, X, i, m, _). Choices: _ plus any class code with category
# Intronic, Fragment, or Overlap.
max_distance: 2000
remove: true
valid_class_codes:
- p
- P
- x
- X
- i
- m
- _
This section of the configuration file deals with how to determine valid ORFs for a transcript from those present in the database. The parameters to control the behaviour of Mikado are the following:
- minimal_orf_length: minimal length of the primary ORF to be loaded onto the transcript. By default, this is set at 50 bps (not aminoacids)
- minimal_secondary_orf_length: minimal length of any ORF that can be assigned to the transcript after the first. This value should be set at a higher setting than minimal_orf_length, in order to avoid loading uORFs [uORFs] into the transcript, leading to spurious break downs of the UTRs. Default: 200 bps.
- strand_specific: boolean. If set to true, only ORFs on the plus strand (ie the same of the cDNA) will be considered. If set to false, monoexonic transcripts mihgt have their strand flipped.
pick:
orf_loading:
# Parameters related to ORF loading.
# - minimal_secondary_orf_length: Minimum length of a *secondary* ORF
# to be loaded after the first, in bp. Default: 200 bps
# - minimal_orf_length: Minimum length in bps of an ORF to be loaded,
# as the primary ORF, onto a transcript. Default: 50 bps
# - strand_specific: Boolean flag. If set to true, monoexonic transcripts
# will not have their ORF reversed even if they would have an ORF on the opposite
# strand.
minimal_orf_length: 50
minimal_secondary_orf_length: 200
strand_specific: true
This section of the configuration file specifies how to deal with transcripts presenting multiple ORFs, ie putative chimeras (see the section above for parameters related to which ORFs can be loaded). Those are identified as transcripts with more than one ORF, where:
- all the ORFs share the same strand
- all the ORFs are non-overlapping, ie they do not share any bp
In these situations, Mikado can try to deal with the chimeras in five different ways, in decreasingly conservative fashion:
- nosplit: leave the transcript unchanged. The presence of multiple ORFs will affect the scoring.
- stringent: leave the transcript unchanged, unless the two ORFs both have hits in the protein database and none of the hits is in common.
- lenient: leave the transcript unchanged, unless either the two ORFs both have hits in the protein database, none of which is in common, or both have no hits in the protein database.
- permissive: presume the transcript is a chimera, and split it, unless two ORFs share a hit in the protein database.
- split: presume that every transcript with more than one ORF is incorrect, and split them.
If any BLAST hit spans the two ORFs, then the model will be considered as a non-chimera because there is evidence that the transcript constitutes a single unit. The only case when this information will be disregarded is during the execution of the split mode.
These modes can be controlled directly from the pick command line.
The behaviour, and when to trigger the check, is controlled by the following parameters:
execute: boolean. If set to false, Mikado will operate in the nosplit mode. If set to true, the choice of the mode will be determined by the other parameters.
blast_check: boolean. Whether to execute the check on the BLAST hits. If set to false, Mikado will operate in the split mode, unless execute is set to false (execute takes precedence over the other parameters).
- blast_params: this section contains the settings relative to the permissive, lenient and stringent mode.
- evalue: maximum evalue of a hit to be assigned to the transcript and therefore be considered.
- hsp_evalue: maximum evalue of a hsp inside a hit to be considered for the analysis.
- leniency: one of LENIENT, PERMISSIVE, STRINGENT. See above for definitions.
- max_target_seqs: integer. when loading BLAST hits from the database, only the first N will be considered for analysis.
- minimal_hsp_overlap: number between 0 and 1. This indicates the overlap that must exist between the HSP and the ORF for the former to be considered for the split.
- min_overlap_duplication: in the case of tandem duplicated genes, a chimera will have two ORFs that share the same hits, but possibly in a peculiar way - the HSPs will insist on the same region of the target sequence. This parameter controls how much overlap counts as a duplication. The default value is of 0.9 (90%).
pick:
chimera_split:
# Parameters related to the splitting of transcripts in the presence of
# two or more ORFs. Parameters:
# - execute: whether to split multi-ORF transcripts at all. Boolean.
# - blast_check: whether to use BLAST information to take a decision. See blast_params
# for details.
# - blast_params: Parameters related to which BLAST data we want to analyse.
blast_check: true
blast_params:
# Parameters for the BLAST check prior to splitting.
# - evalue: Minimum evalue for the whole hit. Default: 1e-6
# - hsp_evalue: Minimum evalue for any HSP hit (some might be discarded even
# if the whole hit is valid). Default: 1e-6
# - leniency: One of STRINGENT, LENIENT, PERMISSIVE. Default: LENIENT
# - max_target_seqs: maximum number of hits to consider. Default: 3
# - minimal_hsp_overlap: minimum overlap of the ORF with the HSP (*not* reciprocal).
# Default: 0.8, i.e. 80%
# - min_overlap_duplication: minimum overlap (in %) for two ORFs to consider
# them as target duplications.
# This means that if two ORFs have no HSPs in common, but the coverage of
# their disjoint HSPs covers more
# Than this % of the length of the *target*, they represent most probably
# a duplicated gene.
evalue: 1.0e-06
hsp_evalue: 1.0e-06
leniency: LENIENT
max_target_seqs: 3
min_overlap_duplication: 0.8
minimal_hsp_overlap: 0.9
execute: true
The “files” and “output_format” sections deal respectively with input files for the pick stage and with some basic settings for the GFF output. Options:
- input: input GTF file for the run. It should be the one generated by the prepare stage, ie the out file of the prepare stage.
- loci_out: main output file. It contains the winning transcripts, separated in their own gene loci, in GFF3 format. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
- log: name of the log file. Default: mikado_pick.log
- monoloci_out: this optional output file will contain the transcripts that have been passed to the monoloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
- subloci_out: this optional output file will contain the transcripts that have been passed to the subloci phase. It will also determine the prefix of the metrics and scores files for this step. See the pick manual page for details on the output.
- output_format: this section specifies some details on the output format.
- id_prefix: prefix for all the final Mikado models. The ID will be <prefix>.<chromosome>G<progressive ID>.
- report_all_orfs: some Mikado models will have more than one ORF (unless pick is operating in the split mode). If this option is set to
true
, Mikado will report the transcript multiple times, one for each ORF, using different progressive IDs (<model name>.orf<progressive ID>). By default, this option is set to False, and only the primary ORF is reported. - source: prefix for the source field in the output files. Loci GFF3 will have “<prefix>_loci”, subloci GFF3s will have “<prefix>_subloci”, and monoloci will have “<prefix>_monoloci”.
pick:
files:
# Input and output files for Mikado pick.
# - gff: input GTF/GFF3 file. Default: mikado_prepared.gtf
# - loci_out: output GFF3 file from Mikado pick. Default: mikado.loci.gff3
# - subloci_out: optional GFF file with the intermediate subloci. Default: no
# output
# - monoloci_out: optional GFF file with the intermediate monoloci. Default:
# no output
# - log: log file for this step.
input: mikado_prepared.gtf
loci_out: mikado.loci.gff3
log: mikado_pick.log
monoloci_out: ''
output_dir: .
subloci_out: ''
output_format:
# Parameters related to the output format.
# - source: prefix for the source field in the mikado output.
# - id_prefix: prefix for the ID of the genes/transcripts in the output
id_prefix: mikado
report_all_orfs: false
source: Mikado
This section deals with other parameters necessary for the run, such as the number of processors to use, but also more important algorithmic parameters such as how to recognise fragments. Parameters:
- consider_truncated_for_retained: normally, Mikado considers as retained introns only events in which a partially coding exon on the 3’ side becomes non-coding in the middle of a CDS intron of another transcript in the locus. If this option is set to true, Mikado will consider as retained intron events also cases when the transcript has its CDS just end within a CDS intron of another model. Useful eg. when dealing with CDS models.
- exclude_cds: whether to remove CDS/UTR information from the Mikado output. Default: false.
- intron_range: tuple that indicates the range of lengths in which most introns should fall. Transcripts with introns either shorter or longer than this interval will be potentially penalised, depending on the scoring scheme. For the paper, this parameter was set to a tuple of integers in which 98% of the introns of the reference annotation were falling (ie cutting out the 1st and 99th percentiles).
- preload: boolean. In certain cases, ie when the database is quite small, it might make sense to preload it in memory rather than relying on SQL queries. Set to false by default.
- shm: boolean. In certain cases, especially when disk access is a severely limiting factor, it might make sense to copy a SQLite database into RAM before querying. If this parameter is set to true, Mikado will copy the SQLite database into a temporary file in RAM, and query it from there.
- shm_db: string. If shm is set to true and this string is non-empty, Mikado will copy the database in memory to a file with this name and leave it there for other Mikado runs. The file will have to be removed manually.
- procs: number of processors to use. Default: 1.
- single_thread: boolean. If set to true, Mikado will completely disable multiprocessing. Useful mostly for debugging reasons.
Warning
the shared-memory options are available only on Linux platforms.
run_options:
# Generic run options.
# - shm: boolean flag. If set and the DB is sqlite, it will be copied onto the
# /dev/shm faux partition
# - shm_db: String. It indicates a DB that has to be copied onto SHM and left
# there for
# concurrent Mikado runs.
# - shm_shared: boolean flag. If set, the database loaded onto SHM will be shared
# and should not be
# deleted at the end of the run (see shm_db).
# for faster access. Default: false
# - exclude_cds: boolean flag. If set, the CDS information will not be printed
# in Mikado output. Default: false
# - procs: number of processes to use. Default: 1
# - preload: boolean flag. If set, the whole database will be preloaded into
# memory for faster access. Useful when
# using SQLite databases.
# - single_thread: boolean flag. If set, multithreading will be disabled - useful
# for profiling and debugging.
# - consider_truncated_for_retained: boolean. Normally, Mikado considers only
# exons which span a whole intron as possible retained intron events. If this
# flag is set to true, also terminal exons will be considered.
# - remove_overlapping_fragments: DEPRECATED, see clustering.
# - purge: DEPRECATED, see clustering.
consider_truncated_for_retained: false
exclude_cds: false
intron_range:
- 60
- 900
preload: false
procs: 1
shm: false
shm_db: ''
single_thread: false
It is possible to set high-level settings for the logs in the log_settings
section:
- log_level: level of the logging for Mikado. Options: DEBUG, INFO, WARNING, ERROR, CRITICAL. By default, Mikado will be quiet and output log messages of severity WARNING or greater.
- sql_level: level of the logging for messages regarding the database connection (through SQLAlchemy). By default, SQLAlchemy will be set in quiet mode and asked to output only messages of severity WARNING or greater.
Warning
Mikado and SQLAlchemy can be greatly verbose if asked to output DEBUG or INFO messages, to the point of slowing down the program significantly due to the amount of writing to disk. Please consider setting the level to DEBUG only when there is a real problem to debug, not otherwise!
log_settings:
# Settings related to the logs. Keys:
# - sql_level: verbosity for SQL calls. Default: WARNING.
# In decreasing order: DEBUG, INFO, WARNING, ERROR, CRITICAL
# - log_level: verbosity. Default: WARNING.
# In decreasing order: DEBUG, INFO, WARNING, ERROR, CRITICAL
log_level: WARNING
sql_level: WARNING
It is also possible to set the type of multiprocessing method that should be used by Python3. The possible choices are “fork”, “spawn”, and “fork-server”.
multiprocessing_method: spawn
The configuration file obeys a specific JSON schema which can be found at Mikado/configuration/configuration_blueprint.json
. Every time a Mikado utility is launched, it checks the configuration file against the schema to validate it. The schema contains non-standard “Comment” and “SimpleComment” string arrays which are used at runtime to generate the comment strings in the YAML output.
Mikado prepare¶
This is the first executive step of the Mikado pipeline. It will accomplish the following goals:
- Collect annotations from disparate annotation files
- Remove redundant assemblies, ie, assemblies that are identical across the various input files.
- Determine the strand of the transcript junctions
- Ensure uniqueness of the transcript names
- Order the transcript by locus
- Extract the transcript sequences.
Mikado prepare
allows to override some of the parameters present in the configuration file through command line options, eg the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.
Available parameters:
- json-conf: the most important parameter. This is the configuration file created through Mikado configure.
- fasta: reference genome. Required, either through the command line or through the configuration file.
- out: Output GTF file, with the collapsed transcripts.
- out_fasta: Output FASTA file of the collapsed transcripts.
- start-method: multiprocessing start method.
- verbose, quiet: flags to set the verbosity of Mikado prepare. It is generally not advised to turn the verbose mode on, unless there is a problem to debug, given the verbosity of the output.
- strand-specific: If set, all assemblies will be treated as strand-specific.
- strand-specific-assemblies: comma-separated list of strand specific assemblies.
- list: in alternative to specifying all the information on the command line, it is possible to give to Mikado a tab-separated file with the following contents:
- Location of the file
- label for the file
- whether that assembly is strand-specific or not (write True/False)
- Optionally, a bonus/malus to be associated to transcripts coming from that assembly.
- log: log file. Optional, by default Mikado will print to standard error.
- lenient: flag. If set, transcripts without any canonical splice site will be output as well. By default, they would be discarded.
- single: flag that disables multiprocessing. Mostly useful for debug purposes.
- strip-cds: some aligners (eg GMAP) will try calculate a CDS on the fly for alignments. Use this flag to discard such CDS sections and retain only the cDNA information.
- minimum_length: minimum length of the transcripts to be kept.
Command line usage:
$ mikado prepare --help
usage: Mikado prepare [-h] [--fasta FASTA] [-v | -q]
[--start-method {fork,spawn,forkserver}]
[-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST]
[-l LOG] [--lenient] [-m MINIMUM_LENGTH] [-p PROCS]
[-scds] [--labels LABELS] [--single] [-od OUTPUT_DIR]
[-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF]
[gff [gff ...]]
Mikado prepare analyses an input GTF file and prepares it for the picking
analysis by sorting its transcripts and performing some simple consistency
checks.
positional arguments:
gff Input GFF/GTF file(s).
optional arguments:
-h, --help show this help message and exit
--fasta FASTA Genome FASTA file. Required.
-v, --verbose
-q, --quiet
--start-method {fork,spawn,forkserver}
Multiprocessing start method.
-s, --strand-specific
Flag. If set, monoexonic transcripts will be left on
their strand rather than being moved to the unknown
strand.
-sa STRAND_SPECIFIC_ASSEMBLIES, --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES
Comma-delimited list of strand specific assemblies.
--list LIST Tab-delimited file containing rows with the following
format <file> <label> <strandedness>
-l LOG, --log LOG Log file. Optional.
--lenient Flag. If set, transcripts with only non-canonical
splices will be output as well.
-m MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
Minimum length for transcripts. Default: 200 bps.
-p PROCS, --procs PROCS
Number of processors to use (default 1)
-scds, --strip_cds Boolean flag. If set, ignores any CDS/UTR segment.
--labels LABELS Labels to attach to the IDs of the transcripts of the
input files, separated by comma.
--single Disable multi-threading. Useful for debugging.
-od OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory. Default: current working directory
-o OUT, --out OUT Output file. Default: mikado_prepared.gtf.
-of OUT_FASTA, --out_fasta OUT_FASTA
Output file. Default: mikado_prepared.fasta.
--json-conf JSON_CONF
Configuration file.
Different assemblers will produce data in different formats, typically in GFF or GTF format, and not necessarily in the same order (if any is present). Mikado will serialise the transcripts from these files and port them all into a standard GTF format. Moreover, it will ensure that each transcript ID appears only once across the input files. The optional labels provided for each file will be attached to the transcript names as prefixes, and used as the source field in the output GTF, to ensure the uniqueness of each transcript name. If two or more transcripts are found to be identical, only one will be retained, chosen at random among all the possibilities. In addition to this, Mikado prepare will also sort the transcripts by coordinate, irrespective of strand, so that they are suitably displayed for the divide-et-impera algorithm of Mikado pick.
Warning
To be considered identical, two transcripts must match down to the last base pair. A simple match or containment of the intron chain will not suffice. This is because using the cDNA data alone it is difficult to understand whether the longer form(s) is the correct assembly rather than a chimera or a trans-splice event.
During its run, Mikado prepare will also check the correctness of the transcripts. In particular:
- Unless the assembly is marked as strand-specific, any monoexonic transcript will have its strand removed.
- If a transcript contains canonical splice junctions on both strands, it will be completely removed
- If a transcript contains only non-canonical splice junctions, it will be removed unless the “lenient” option is specified either at the command line or in the configuration file.
The couples of splice acceptors and donors which are considered as canonical can be specified in the configuration file. By default, Mikado will consider as canonical both properly canonical splicing event (GT-AG) as well as the semi-canonical events (GC-AG, AT-AC). Any other couple will be considered as non-canonical.
Warning
Mikado will check the strand of each junction inside a transcript independently. Therefore, if a transcript with 9 junctions on the plus strand is found to have a non-canonical splicing junction which happens to be the reverse of a canonical one (eg. CT-AC), it will deem this junction as misassigned to the wrong strand and flip it to the minus strand. In this example, the transcript will therefore be considered as an error as it contains both + and - junctions, and discarded.
Mikado prepare will produce two files:
- a sorted GTF file, containing all the transcripts surviving the checks
- a FASTA file of the transcripts, in the proper cDNA orientation.
Warning
contrary to other tools such as eg gffread from Cufflinks [Cufflinks], Mikado prepare will not try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will not have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately.
Mikado serialise¶
Mikado integrates data from multiple sources to select the best transcripts. During this step, these sources are brought together inside a common database, simplifying the process of retrieving them at runtime. Currently, Mikado integrates three different types of data:
- reliable junctions, as detected by Portcullis, in BED12 format (
- ORF data, currently identified using TransDecoder, but any program capable of generating it in BED12 format is suitable.
- BLASTX [Blastplus] data in XML format
After serialisation in the database, these sources will be available to use for any subsequent Mikado pick run. Having the data present in the database allows to run Mikado with multiple configurations and little overhead in terms of pre-loading data; this feature is taken directly advtange of in The Daijin pipeline for driving Mikado, where it is possible to run Mikado using multiple modes.
Mikado serialise can use three different SQL databases as backends - SQLite, MySQL and PostgreSQL - thanks to SQLAlchemy. This step, together with the creation of the TransDecoder and BLAST data, is the most time consuming of the pipeline. In particular, although Mikado serialise will try to analyse the XML data in a parallelised fashion if so instructed, the insertion of the data in the database will still happen in a single thread and will therefore be of limited speed. If using SQLite as database (the default option), it is possible to decrease the runtime by modifying the “max_objects” parameters, at the cost however of increased RAM usage.
When Mikado analyses ORFs produced by TransDecoder or equivalent program, it performs additionally the following checks:
- Check the congruence between the length of the transcript in the BED12 file and that found in the FASTA file
- Check that the ORF does not contain internal stop codons
- Check that the CDS length is valid, ie a multiple of 3, if the ORF is complete
- Optionally, if the ORF is open on the 5’ side, Mikado can try to find an internal start codon. See :this section <max-regression> for details.
mikado serialise
allows to override some of the parameters present in the configuration file through command line options, eg the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications.
Available parameters:
- Parameters related to performance:
- start-method: one of fork, spawn, forkserver. It determines the multiprocessing start method. By default, Mikado will use the default for the system (fork on UNIX, spawn on Windows).
- procs: Number of processors to use.
- single-thread: flag. If set, Mikado will disable all multithreading.
- max_objects: Maximum number of objects to keep in memory before committing to the database. See this section of the configuration for details.
- Basic input data and settings:
- output-dir: directory where the SQLite database and the log will be written to.
- transcripts: these are the input transcripts that are present on the GTF file considered by Mikado. Normally this should be the output of Mikado prepare.
- genome_fai: FAIDX file of the genome FASTA. If not given, serialise will derive it from the “reference: genome” field of the configuration.
- force: flag. If set, and the database is already present, it will be truncated rather than updated.
- json-conf: this is the configuration file created with Mikado configure.
- db: if the database is specified on the command line,
mikado serialise
will interpret it as a SQLite database. This will overwrite any setting present in the configuration file.
- Parameters related to logging:
- log: log file. It defaults to
serialise.log
. - log_level: verbosity of the logging. Please be advised that excessive verbosity can negatively impact the performance of the program - the debug mode is extremely verbose.
- log: log file. It defaults to
- Parameters related to reliable junctions:
- junctions: a BED12 file of reliable junctions. This can be obtained using Portcullis. Please see the relative sidebar.
- Parameters related to the treatment of ORF data:
- orfs: ORF BED12 files, separated by comma.
- max-regression: A percentage, expressed as a number between 0 and 1, which indicates how far can Mikado regress along the ORF to find a valid start codon. See the relative section in the configuration for details.
- Parameters related to BLAST data:
- blast_targets: BLAST FASTA database.
- discard-definition: Flag. Depending on how the database has been created, sometimes BLAST will substitute the ID of the sequence with “lcl|” ids. Mikado circumvents this by looking for the definition field in the XML file. Using this flag will disable this behaviour and force Mikado to use the ID - with the potential of having a mismatch between the sequences in the BLAST DB and the sequences in the BLAST files.
- xml: BLAST files to parse. This can be one of the following:
- A list of XML BLAST files, optionally compressed with GZip or BZip2, comma separated (suffix .xml)
- A list of ASN BLAST files, optionally compressed with GZip or BZip2, comma separated (suffix .asn)
- A list of folders, comma separated, where it is possible to find files of the former 2 types
- A mixture of the three above types.
- max-target-seqs: maximum number of BLAST targets that can be loaded per sequence, for each BLAST alignment. Please note that if you align against multiple databases, this threshold will be applied once per file.
Hint
Mikado will parallelise only the reading of multiple XML files. As such, this part of the pipeline is less performing than the other steps.
Warning
It is advised to set this parameter to spawn even on UNIX. See the dedicated sidebar for details.
Usage:
$ mikado serialise --help
usage: Mikado serialise [-h] [--start-method {fork,spawn,forkserver}]
[--orfs ORFS] [--transcripts TRANSCRIPTS]
[-mr MAX_REGRESSION]
[--max_target_seqs MAX_TARGET_SEQS]
[--blast_targets BLAST_TARGETS] [--discard-definition]
[--xml XML] [-p PROCS] [--single-thread]
[--genome_fai GENOME_FAI] [--junctions JUNCTIONS]
[-mo MAX_OBJECTS] [-f] --json-conf JSON_CONF
[-l [LOG]] [-od OUTPUT_DIR]
[-lv {DEBUG,INFO,WARN,ERROR}]
[db]
Mikado serialise creates the database used by the pick program. It handles
Junction and ORF BED12 files as well as BLAST XML results.
optional arguments:
-h, --help show this help message and exit
--start-method {fork,spawn,forkserver}
Multiprocessing start method.
-od OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory. Default: current working directory
--orfs ORFS ORF BED file(s), separated by commas
--transcripts TRANSCRIPTS
Transcript FASTA file(s) used for ORF calling and
BLAST queries, separated by commas. If multiple files
are given, they must be in the same order of the ORF
files. E.g. valid command lines are:
--transcript_fasta all_seqs1.fasta --orfs all_orfs.bed
--transcript_fasta seq1.fasta,seq2.fasta --orfs
orfs1.bed,orf2.bed --transcript_fasta all_seqs.fasta
--orfs orfs1.bed,orf2.bed These are invalid instead: #
Inverted order --transcript_fasta
seq1.fasta,seq2.fasta --orfs orfs2.bed,orf1.bed #Two
transcript files, one ORF file --transcript_fasta
seq1.fasta,seq2.fasta --orfs all_orfs.bed
-mr MAX_REGRESSION, --max-regression MAX_REGRESSION
"Amount of sequence in the ORF (in %) to backtrack in
order to find a valid START codon, if one is absent.
Default: None
--max_target_seqs MAX_TARGET_SEQS
Maximum number of target sequences.
--blast_targets BLAST_TARGETS
Target sequences
--discard-definition Flag. If set, the sequences IDs instead of their
definition will be used for serialisation.
--xml XML XML file(s) to parse. They can be provided in three
ways: - a comma-separated list - as a base folder -
using bash-like name expansion (*,?, etc.). In this
case, you have to enclose the filename pattern in
double quotes. Multiple folders/file patterns can be
given, separated by a comma.
-p PROCS, --procs PROCS
Number of threads to use for analysing the BLAST
files. This number should not be higher than the total
number of XML files.
--single-thread Force serialise to run with a single thread,
irrespective of other configuration options.
--genome_fai GENOME_FAI
--junctions JUNCTIONS
-mo MAX_OBJECTS, --max-objects MAX_OBJECTS
Maximum number of objects to cache in memory before
committing to the database. Default: 100,000 i.e.
approximately 450MB RAM usage for Drosophila.
-f, --force Flag. If set, an existing databse will be deleted
(sqlite) or dropped (MySQL/PostGreSQL) before
beginning the serialisation.
--json-conf JSON_CONF
-l [LOG], --log [LOG]
Optional log file. Default: stderr
-lv {DEBUG,INFO,WARN,ERROR}, --log_level {DEBUG,INFO,WARN,ERROR}
Log level. Default: INFO
db Optional output database. Default: derived from
json_conf
The schema of the database is quite simple, as it is composed only of 7 discrete tables in two groups. The first group, chrom and junctions, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations. The second group serialises the data regarding ORFs and BLAST files. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime.
Mikado pick¶
This is the final stage of the pipeline, in which Mikado identifies gene loci and selects the best transcripts.
mikado pick
requires as input files the following:
- A sorted GTF files with unique transcript names, derived through the prepare stage.
- A database containing all the external data to be integrated with the transcript structure, derived through the serialisation stage.
- A scoring file specifying the minimum requirements for transcripts and the relevant metrics for scoring. See the section on scoring files for details.
mikado pick
will produce three kinds of output files: a GFF3 file, a metrics file, and a scores file. This triad will be produced for the loci level, and optionally also for the subloci and monoloci level.
This output file is a standard-compliant GFF, with the addition of the superloci to indicate the original spans. An example with two superloci, one on the negative and one on the positive strand, follows:
##gff-version 3
##sequence-region Chr5 1 26975502
Chr5 Mikado_loci superlocus 26584796 26601707 . + . ID=Mikado_superlocus:Chr5+:26584796-26601707;Name=superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci gene 26584796 26587912 23 + . ID=mikado.Chr5G1;Name=mikado.Chr5G1;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci mRNA 26584796 26587912 24 + . ID=mikado.Chr5G1.2;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.2;alias=st_Stringtie_STAR.21710.1;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;ccode=j;cov=25.165945;primary=False
Chr5 Mikado_loci exon 26584796 26584879 . + . ID=mikado.Chr5G1.2.exon1;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci five_prime_UTR 26584796 26584879 . + . ID=mikado.Chr5G1.2.five_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26585220 26585273 . + . ID=mikado.Chr5G1.2.exon2;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci five_prime_UTR 26585220 26585222 . + . ID=mikado.Chr5G1.2.five_prime_UTR2;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26585223 26585273 . + 0 ID=mikado.Chr5G1.2.CDS1;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26585345 26585889 . + 0 ID=mikado.Chr5G1.2.CDS2;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26585345 26585889 . + . ID=mikado.Chr5G1.2.exon3;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26585982 26586102 . + 1 ID=mikado.Chr5G1.2.CDS3;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26585982 26586102 . + . ID=mikado.Chr5G1.2.exon4;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26586217 26586294 . + 0 ID=mikado.Chr5G1.2.CDS4;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26586217 26586294 . + . ID=mikado.Chr5G1.2.exon5;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26586420 26586524 . + 0 ID=mikado.Chr5G1.2.CDS5;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26586420 26586524 . + . ID=mikado.Chr5G1.2.exon6;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26586638 26586850 . + 0 ID=mikado.Chr5G1.2.CDS6;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26586638 26586850 . + . ID=mikado.Chr5G1.2.exon7;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26586934 26586996 . + 0 ID=mikado.Chr5G1.2.CDS7;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26586934 26586996 . + . ID=mikado.Chr5G1.2.exon8;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26587084 26587202 . + 0 ID=mikado.Chr5G1.2.CDS8;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26587084 26587202 . + . ID=mikado.Chr5G1.2.exon9;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26587287 26587345 . + 1 ID=mikado.Chr5G1.2.CDS9;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26587287 26587345 . + . ID=mikado.Chr5G1.2.exon10;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci CDS 26587427 26587755 . + 2 ID=mikado.Chr5G1.2.CDS10;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci exon 26587427 26587912 . + . ID=mikado.Chr5G1.2.exon11;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci three_prime_UTR 26587756 26587912 . + . ID=mikado.Chr5G1.2.three_prime_UTR1;Parent=mikado.Chr5G1.2
Chr5 Mikado_loci mRNA 26584930 26587912 23 + . ID=mikado.Chr5G1.1;Parent=mikado.Chr5G1;Name=mikado.Chr5G1.1;alias=st_Stringtie_STAR.21710.3;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;cov=2.207630;primary=True
Chr5 Mikado_loci exon 26584930 26585023 . + . ID=mikado.Chr5G1.1.exon1;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci five_prime_UTR 26584930 26585023 . + . ID=mikado.Chr5G1.1.five_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26585220 26585273 . + . ID=mikado.Chr5G1.1.exon2;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci five_prime_UTR 26585220 26585222 . + . ID=mikado.Chr5G1.1.five_prime_UTR2;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26585223 26585273 . + 0 ID=mikado.Chr5G1.1.CDS1;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26585345 26585889 . + 0 ID=mikado.Chr5G1.1.CDS2;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26585345 26585889 . + . ID=mikado.Chr5G1.1.exon3;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26585982 26586102 . + 1 ID=mikado.Chr5G1.1.CDS3;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26585982 26586102 . + . ID=mikado.Chr5G1.1.exon4;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26586217 26586294 . + 0 ID=mikado.Chr5G1.1.CDS4;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26586217 26586294 . + . ID=mikado.Chr5G1.1.exon5;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26586420 26586524 . + 0 ID=mikado.Chr5G1.1.CDS5;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26586420 26586524 . + . ID=mikado.Chr5G1.1.exon6;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26586638 26586850 . + 0 ID=mikado.Chr5G1.1.CDS6;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26586638 26586850 . + . ID=mikado.Chr5G1.1.exon7;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26586934 26586996 . + 0 ID=mikado.Chr5G1.1.CDS7;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26586934 26586996 . + . ID=mikado.Chr5G1.1.exon8;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26587084 26587202 . + 0 ID=mikado.Chr5G1.1.CDS8;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26587084 26587202 . + . ID=mikado.Chr5G1.1.exon9;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26587287 26587345 . + 1 ID=mikado.Chr5G1.1.CDS9;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26587287 26587345 . + . ID=mikado.Chr5G1.1.exon10;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci CDS 26587427 26587755 . + 2 ID=mikado.Chr5G1.1.CDS10;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci exon 26587427 26587912 . + . ID=mikado.Chr5G1.1.exon11;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci three_prime_UTR 26587756 26587912 . + . ID=mikado.Chr5G1.1.three_prime_UTR1;Parent=mikado.Chr5G1.1
Chr5 Mikado_loci gene 26588402 26592561 20 + . ID=mikado.Chr5G2;Name=mikado.Chr5G2;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci mRNA 26588402 26592561 24 + . ID=mikado.Chr5G2.2;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.2;alias=st_Stringtie_STAR.21710.9.split1;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;ccode=j;cov=0.000000;primary=False
Chr5 Mikado_loci exon 26588402 26588625 . + . ID=mikado.Chr5G2.2.exon1;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci five_prime_UTR 26588402 26588625 . + . ID=mikado.Chr5G2.2.five_prime_UTR1;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26589203 26589279 . + . ID=mikado.Chr5G2.2.exon2;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci five_prime_UTR 26589203 26589237 . + . ID=mikado.Chr5G2.2.five_prime_UTR2;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26589238 26589279 . + 0 ID=mikado.Chr5G2.2.CDS1;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26589386 26590167 . + 0 ID=mikado.Chr5G2.2.CDS2;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26589386 26590167 . + . ID=mikado.Chr5G2.2.exon3;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26590261 26590393 . + 1 ID=mikado.Chr5G2.2.CDS3;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26590261 26590393 . + . ID=mikado.Chr5G2.2.exon4;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26590495 26590566 . + 0 ID=mikado.Chr5G2.2.CDS4;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26590495 26590566 . + . ID=mikado.Chr5G2.2.exon5;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26590641 26590739 . + 0 ID=mikado.Chr5G2.2.CDS5;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26590641 26590739 . + . ID=mikado.Chr5G2.2.exon6;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26590880 26591092 . + 0 ID=mikado.Chr5G2.2.CDS6;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26590880 26591092 . + . ID=mikado.Chr5G2.2.exon7;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26591174 26591236 . + 0 ID=mikado.Chr5G2.2.CDS7;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26591174 26591236 . + . ID=mikado.Chr5G2.2.exon8;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26591324 26591442 . + 0 ID=mikado.Chr5G2.2.CDS8;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26591324 26591442 . + . ID=mikado.Chr5G2.2.exon9;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26591520 26591578 . + 1 ID=mikado.Chr5G2.2.CDS9;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26591520 26591578 . + . ID=mikado.Chr5G2.2.exon10;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26591681 26592002 . + 2 ID=mikado.Chr5G2.2.CDS10;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26591681 26592002 . + . ID=mikado.Chr5G2.2.exon11;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci CDS 26592528 26592561 . + 1 ID=mikado.Chr5G2.2.CDS11;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci exon 26592528 26592561 . + . ID=mikado.Chr5G2.2.exon12;Parent=mikado.Chr5G2.2
Chr5 Mikado_loci mRNA 26588402 26592561 20 + . ID=mikado.Chr5G2.1;Parent=mikado.Chr5G2;Name=mikado.Chr5G2.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5 Mikado_loci exon 26588402 26588625 . + . ID=mikado.Chr5G2.1.exon1;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci five_prime_UTR 26588402 26588625 . + . ID=mikado.Chr5G2.1.five_prime_UTR1;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26589196 26589279 . + . ID=mikado.Chr5G2.1.exon2;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci five_prime_UTR 26589196 26589237 . + . ID=mikado.Chr5G2.1.five_prime_UTR2;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26589238 26589279 . + 0 ID=mikado.Chr5G2.1.CDS1;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26589386 26590167 . + 0 ID=mikado.Chr5G2.1.CDS2;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26589386 26590167 . + . ID=mikado.Chr5G2.1.exon3;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26590261 26590393 . + 1 ID=mikado.Chr5G2.1.CDS3;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26590261 26590393 . + . ID=mikado.Chr5G2.1.exon4;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26590495 26590566 . + 0 ID=mikado.Chr5G2.1.CDS4;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26590495 26590566 . + . ID=mikado.Chr5G2.1.exon5;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26590641 26590739 . + 0 ID=mikado.Chr5G2.1.CDS5;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26590641 26590739 . + . ID=mikado.Chr5G2.1.exon6;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26590880 26591092 . + 0 ID=mikado.Chr5G2.1.CDS6;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26590880 26591092 . + . ID=mikado.Chr5G2.1.exon7;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26591174 26591236 . + 0 ID=mikado.Chr5G2.1.CDS7;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26591174 26591236 . + . ID=mikado.Chr5G2.1.exon8;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26591324 26591442 . + 0 ID=mikado.Chr5G2.1.CDS8;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26591324 26591442 . + . ID=mikado.Chr5G2.1.exon9;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26591520 26591578 . + 1 ID=mikado.Chr5G2.1.CDS9;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26591520 26591578 . + . ID=mikado.Chr5G2.1.exon10;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26591681 26592002 . + 2 ID=mikado.Chr5G2.1.CDS10;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26591681 26592002 . + . ID=mikado.Chr5G2.1.exon11;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci CDS 26592528 26592561 . + 1 ID=mikado.Chr5G2.1.CDS11;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci exon 26592528 26592561 . + . ID=mikado.Chr5G2.1.exon12;Parent=mikado.Chr5G2.1
Chr5 Mikado_loci gene 26592649 26595691 19 + . ID=mikado.Chr5G3;Name=mikado.Chr5G3;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci mRNA 26592720 26595691 19 + . ID=mikado.Chr5G3.1;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.1;alias=st_Stringtie_STAR.21710.7.split2;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20;canonical_number=20;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5 Mikado_loci CDS 26592720 26593365 . + 0 ID=mikado.Chr5G3.1.CDS1;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26592720 26593365 . + . ID=mikado.Chr5G3.1.exon1;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26593449 26593836 . + 2 ID=mikado.Chr5G3.1.CDS2;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26593449 26593836 . + . ID=mikado.Chr5G3.1.exon2;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26593930 26594062 . + 1 ID=mikado.Chr5G3.1.CDS3;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26593930 26594062 . + . ID=mikado.Chr5G3.1.exon3;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26594172 26594243 . + 0 ID=mikado.Chr5G3.1.CDS4;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26594172 26594243 . + . ID=mikado.Chr5G3.1.exon4;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26594318 26594416 . + 0 ID=mikado.Chr5G3.1.CDS5;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26594318 26594416 . + . ID=mikado.Chr5G3.1.exon5;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26594569 26594772 . + 0 ID=mikado.Chr5G3.1.CDS6;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26594569 26594772 . + . ID=mikado.Chr5G3.1.exon6;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26594860 26594922 . + 0 ID=mikado.Chr5G3.1.CDS7;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26594860 26594922 . + . ID=mikado.Chr5G3.1.exon7;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26595003 26595121 . + 0 ID=mikado.Chr5G3.1.CDS8;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26595003 26595121 . + . ID=mikado.Chr5G3.1.exon8;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26595210 26595268 . + 1 ID=mikado.Chr5G3.1.CDS9;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26595210 26595268 . + . ID=mikado.Chr5G3.1.exon9;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci CDS 26595366 26595691 . + 2 ID=mikado.Chr5G3.1.CDS10;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci exon 26595366 26595691 . + . ID=mikado.Chr5G3.1.exon10;Parent=mikado.Chr5G3.1
Chr5 Mikado_loci mRNA 26592649 26595268 21 + . ID=mikado.Chr5G3.2;Parent=mikado.Chr5G3;Name=mikado.Chr5G3.2;abundance=2.390309;alias=cl_Chr5.6283;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;primary=False
Chr5 Mikado_loci exon 26592649 26593365 . + . ID=mikado.Chr5G3.2.exon1;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci five_prime_UTR 26592649 26592719 . + . ID=mikado.Chr5G3.2.five_prime_UTR1;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26592720 26593365 . + 0 ID=mikado.Chr5G3.2.CDS1;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26593449 26593836 . + 2 ID=mikado.Chr5G3.2.CDS2;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26593449 26593836 . + . ID=mikado.Chr5G3.2.exon2;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26593930 26594095 . + 1 ID=mikado.Chr5G3.2.CDS3;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26593930 26594095 . + . ID=mikado.Chr5G3.2.exon3;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26594172 26594243 . + 0 ID=mikado.Chr5G3.2.CDS4;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26594172 26594243 . + . ID=mikado.Chr5G3.2.exon4;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26594318 26594416 . + 0 ID=mikado.Chr5G3.2.CDS5;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26594318 26594416 . + . ID=mikado.Chr5G3.2.exon5;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26594569 26594772 . + 0 ID=mikado.Chr5G3.2.CDS6;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26594569 26594772 . + . ID=mikado.Chr5G3.2.exon6;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26594860 26594922 . + 0 ID=mikado.Chr5G3.2.CDS7;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26594860 26594922 . + . ID=mikado.Chr5G3.2.exon7;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26595003 26595121 . + 0 ID=mikado.Chr5G3.2.CDS8;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26595003 26595121 . + . ID=mikado.Chr5G3.2.exon8;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci CDS 26595210 26595268 . + 1 ID=mikado.Chr5G3.2.CDS9;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci exon 26595210 26595268 . + . ID=mikado.Chr5G3.2.exon9;Parent=mikado.Chr5G3.2
Chr5 Mikado_loci gene 26596207 26598231 20 + . ID=mikado.Chr5G4;Name=mikado.Chr5G4;multiexonic=False;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci mRNA 26596207 26598231 20 + . ID=mikado.Chr5G4.1;Parent=mikado.Chr5G4;Name=mikado.Chr5G4.1;alias=st_Stringtie_STAR.21710.6.split3;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21;canonical_number=21;canonical_proportion=1.0;cov=0.000000;primary=True
Chr5 Mikado_loci CDS 26596207 26598192 . + 0 ID=mikado.Chr5G4.1.CDS1;Parent=mikado.Chr5G4.1
Chr5 Mikado_loci exon 26596207 26598231 . + . ID=mikado.Chr5G4.1.exon1;Parent=mikado.Chr5G4.1
Chr5 Mikado_loci three_prime_UTR 26598193 26598231 . + . ID=mikado.Chr5G4.1.three_prime_UTR1;Parent=mikado.Chr5G4.1
Chr5 Mikado_loci gene 26599417 26601137 20 + . ID=mikado.Chr5G5;Name=mikado.Chr5G5;multiexonic=True;superlocus=Mikado_superlocus:Chr5+:26584796-26601707
Chr5 Mikado_loci mRNA 26599417 26601137 20 + . ID=mikado.Chr5G5.1;Parent=mikado.Chr5G5;Name=mikado.Chr5G5.1;abundance=0.371780;alias=cl_Chr5.6286;canonical_junctions=1,2,3,4,5,6;canonical_number=6;canonical_proportion=1.0;primary=True
Chr5 Mikado_loci exon 26599417 26599654 . + . ID=mikado.Chr5G5.1.exon1;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci five_prime_UTR 26599417 26599612 . + . ID=mikado.Chr5G5.1.five_prime_UTR1;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26599613 26599654 . + 0 ID=mikado.Chr5G5.1.CDS1;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26599767 26600053 . + 0 ID=mikado.Chr5G5.1.CDS2;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26599767 26600053 . + . ID=mikado.Chr5G5.1.exon2;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26600151 26600244 . + 1 ID=mikado.Chr5G5.1.CDS3;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26600151 26600244 . + . ID=mikado.Chr5G5.1.exon3;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26600314 26600394 . + 0 ID=mikado.Chr5G5.1.CDS4;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26600314 26600394 . + . ID=mikado.Chr5G5.1.exon4;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26600497 26600616 . + 0 ID=mikado.Chr5G5.1.CDS5;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26600497 26600616 . + . ID=mikado.Chr5G5.1.exon5;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26600696 26600908 . + 0 ID=mikado.Chr5G5.1.CDS6;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26600696 26600908 . + . ID=mikado.Chr5G5.1.exon6;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci CDS 26600987 26601085 . + 0 ID=mikado.Chr5G5.1.CDS7;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci exon 26600987 26601137 . + . ID=mikado.Chr5G5.1.exon7;Parent=mikado.Chr5G5.1
Chr5 Mikado_loci three_prime_UTR 26601086 26601137 . + . ID=mikado.Chr5G5.1.three_prime_UTR1;Parent=mikado.Chr5G5.1
###
Chr5 Mikado_loci superlocus 26575364 26579730 . - . ID=Mikado_superlocus:Chr5-:26575364-26579730;Name=superlocus:Chr5-:26575364-26579730
Chr5 Mikado_loci ncRNA_gene 26575364 26579730 18 - . ID=mikado.Chr5G6;Name=mikado.Chr5G6;multiexonic=True;superlocus=Mikado_superlocus:Chr5-:26575364-26579730
Chr5 Mikado_loci ncRNA 26575711 26579730 18 - . ID=mikado.Chr5G6.1;Parent=mikado.Chr5G6;Name=cl_Chr5.6271;abundance=1.141582;alias=cl_Chr5.6271;canonical_junctions=1,2,3,4,5,6,7,8,9,10;canonical_number=10;canonical_proportion=1.0;primary=True
Chr5 Mikado_loci exon 26575711 26575797 . - . ID=mikado.Chr5G6.1.exon1;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26575885 26575944 . - . ID=mikado.Chr5G6.1.exon2;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26576035 26576134 . - . ID=mikado.Chr5G6.1.exon3;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26576261 26577069 . - . ID=mikado.Chr5G6.1.exon4;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26577163 26577288 . - . ID=mikado.Chr5G6.1.exon5;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26577378 26577449 . - . ID=mikado.Chr5G6.1.exon6;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26577856 26577937 . - . ID=mikado.Chr5G6.1.exon7;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26578239 26578792 . - . ID=mikado.Chr5G6.1.exon8;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26579079 26579161 . - . ID=mikado.Chr5G6.1.exon9;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26579301 26579395 . - . ID=mikado.Chr5G6.1.exon10;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci exon 26579602 26579730 . - . ID=mikado.Chr5G6.1.exon11;Parent=mikado.Chr5G6.1
Chr5 Mikado_loci ncRNA 26578496 26579563 13 - . ID=mikado.Chr5G6.3;Parent=mikado.Chr5G6;Name=tr_c73_g1_i1.mrna1.160;alias=tr_c73_g1_i1.mrna1.160;canonical_junctions=1;canonical_number=1;canonical_proportion=1.0;ccode=j;gene_name=c73_g1_i1;primary=False
Chr5 Mikado_loci exon 26578496 26578518 . - . ID=mikado.Chr5G6.3.exon1;Parent=mikado.Chr5G6.3
Chr5 Mikado_loci exon 26579301 26579563 . - . ID=mikado.Chr5G6.3.exon2;Parent=mikado.Chr5G6.3
Chr5 Mikado_loci ncRNA 26575364 26578163 16 - . ID=mikado.Chr5G6.2;Parent=mikado.Chr5G6;Name=cuff_cufflinks_star_at.23553.1;alias=cuff_cufflinks_star_at.23553.1;fpkm=2.9700103727;canonical_junctions=1,2,3,4,5,6,7,8;canonical_number=8;canonical_proportion=1.0;ccode=j;conf_hi=3.260618;conf_lo=2.679403;cov=81.895309;frac=0.732092;primary=False
Chr5 Mikado_loci exon 26575364 26575410 . - . ID=mikado.Chr5G6.2.exon1;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26575495 26575620 . - . ID=mikado.Chr5G6.2.exon2;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26575711 26575797 . - . ID=mikado.Chr5G6.2.exon3;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26575885 26575944 . - . ID=mikado.Chr5G6.2.exon4;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26576035 26576134 . - . ID=mikado.Chr5G6.2.exon5;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26576261 26577069 . - . ID=mikado.Chr5G6.2.exon6;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26577163 26577288 . - . ID=mikado.Chr5G6.2.exon7;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26577378 26577449 . - . ID=mikado.Chr5G6.2.exon8;Parent=mikado.Chr5G6.2
Chr5 Mikado_loci exon 26577856 26578163 . - . ID=mikado.Chr5G6.2.exon9;Parent=mikado.Chr5G6.2
###
- Things to note:
- multiple RNAs for the same gene are identified by progressive enumeration after a “.” (eg. mikado.Chr5G5.1, mikado.Chr5G5.2, etc.).
- All RNAs retain their old name under the attribute “alias”. If a transcript was split due to the presence of multiple ORFs, its alias will end with “.split<progressive ID>”.
- RNAs have the boolean attribute “primary”, which identifies them as the primary transcript of the gene or as an alternative splicing isoform.
- Non-primary RNAs have the additional “ccode” field, which identifies the class code assigned to them when they were compared to the primary transcript.
- multiexonic RNAs have the attributes “canonical_junctions”, “canonical_number”, and “canonical_proportion” assigned to them. These properties are calculated by Mikado during the prepare stage.
These are tabular files that enumerate all the metrics raw values for each transcript. This is the section of the metrics file corresponding to the GFF3 file above:
tid parent score best_bits blast_score canonical_intron_proportion cdna_length cds_not_maximal cds_not_maximal_fraction combined_cds_fraction combined_cds_intron_fractioncombined_cds_length combined_cds_num combined_cds_num_fraction combined_utr_fraction combined_utr_length end_distance_from_junction end_distance_from_tes exon_fraction exon_num five_utr_length five_utr_num five_utr_num_complete has_start_codon has_stop_codon highest_cds_exon_number highest_cds_exons_num intron_fraction is_complete max_intron_length min_intron_length non_verified_introns_num num_introns_greater_than_max num_introns_smaller_than_min number_internal_orfs proportion_verified_introns proportion_verified_introns_inlocusretained_fraction retained_intron_num selected_cds_exons_fraction selected_cds_fraction selected_cds_intron_fraction selected_cds_length selected_cds_num selected_cds_number_fraction selected_end_distance_from_junction selected_end_distance_from_tes selected_start_distance_from_tss snowy_blast_score source_score start_distance_from_tss three_utr_length three_utr_num three_utr_num_complete utr_fraction utr_length utr_num utr_num_complete verified_introns_num
mikado.Chr5G1.2 mikado.Chr5G1 19.0 1086.25 1086.25 1.0 1927 0 0.0 0.87 1.0 1683 10 0.91 0.13 244 0 157 0.92 11 87 2 1 TrueTrue 10 10 0.91 True 340 71 10 0 0 1 0.0 0 0.0 0 0.91 0.87 1.0 1683 10 0.91 0 157 87 13.78 0 87 157 1 0 0.13 244 3 1 0
mikado.Chr5G1.1 mikado.Chr5G1 21.89 1086.63 1086.63 1.0 1937 0 0.0 0.87 1.0 1683 10 0.91 0.13 254 0 157 0.92 11 97 2 1 TrueTrue 10 10 0.91 True 196 71 10 0 0 1 0.0 0 0.0 0 0.91 0.87 1.0 1683 10 0.91 0 157 97 13.78 0 97 157 1 0 0.13 254 3 1 0
mikado.Chr5G2.2 mikado.Chr5G2 19.04 1140.95 1140.95 1.0 2197 0 0.0 0.88 1.0 1938 11 0.92 0.12 259 0 0 0.92 12 259 2 1 TrueTrue 11 11 0.92 True 577 74 11 0 0 1 0.0 0 0.0 0 0.92 0.88 1.0 1938 11 0.92 0 0 259 16.66 0 259 0 0 0 0.12 259 2 1 0
mikado.Chr5G2.1 mikado.Chr5G2 20.06 1140.95 1140.95 1.0 2204 0 0.0 0.88 1.0 1938 11 0.92 0.12 266 0 0 0.92 12 266 2 1 TrueTrue 11 11 0.92 True 570 74 11 0 0 1 0.0 0 0.0 0 0.92 0.88 1.0 1938 11 0.92 0 0 266 16.66 0 266 0 0 0 0.12 266 2 1 0
mikado.Chr5G3.2 mikado.Chr5G3 8.59 1193.72 1193.72 1.0 1887 0 0.0 0.96 0.8 1816 9 1.0 0.04 71 0 0 0.75 9 71 1 0 TrueFalse 9 9 0.8 False 152 74 8 0 0 1 0.0 0 0.0 0 1.0 0.96 0.8 1816 9 1.0 0 0 71 14.16 0 71 0 0 0 0.04 71 1 0 0
mikado.Chr5G3.1 mikado.Chr5G3 19.0 1353.19 1353.19 1.0 2109 0 0.0 1.0 0.9 2109 10 1.0 0.0 0 0 0 0.83 10 0 0 0 TrueTrue 10 10 0.9 True 152 74 9 0 0 1 0.0 0 0.0 0 1.0 1.0 0.9 2109 10 1.0 0 0 0 16.66 0 00 0 0 0.0 0 0 0 0
mikado.Chr5G4.1 mikado.Chr5G4 20.0 1258.43 1258.43 1.0 2025 0 0.0 0.98 0 1986 1 1.0 0.02 39 0 39 1.0 1 0 0 0 TrueTrue 1 1 0 True 0 0 0 0 0 1 0 0 0.0 0 1.0 0.98 0 1986 1 1.0 0 39 0 16.66 0 039 1 0 0.02 39 1 0 0
mikado.Chr5G5.1 mikado.Chr5G5 20.0 565.46 565.46 1.0 1184 0 0.0 0.79 1.0 936 7 1.0 0.21 248 0 52 1.0 7 196 1 0 TrueTrue 7 7 1.0 True 112 69 6 0 0 1 0.0 0 0.0 0 1.0 0.79 1.0 936 7 1.0 0 52 196 13.67 0 196 52 1 0 0.21 248 2 0 0
mikado.Chr5G6.2 mikado.Chr5G6 17.1 0 0 1.0 1735 0 0 0.0 0 0 0 0.0 1.0 0 0 0 0.56 9 0 0 0 False False 0 0 0.62 False 406 84 0 0 0 0 1.0 0.89 0.0 0 0.0 0.0 0 0 0 0.0 0 0 0 0 00 0 0 0 1.0 0 0 0 8
mikado.Chr5G6.1 mikado.Chr5G6 17.9 0 0 1.0 2197 0 0 0.0 0 0 0 0.0 1.0 0 0 0 0.69 11 0 0 0 False False 0 0 0.77 False 406 87 3 0 0 0 0.9 1.0 0.0 0 0.0 0.0 0 0 0 0.0 0 0 0 0 00 0 0 0 1.0 0 0 0 9
mikado.Chr5G6.3 mikado.Chr5G6 13.0 0 0 1.0 286 0 0 0.0 0 0 0 0.0 1.0 0 0 0 0.12 2 0 0 0 False False 0 0 0.08 False 782 782 1 0 0 0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0.0 0 0 0 0 00 0 0 0 1.0 0 0 0 0
As it can be noted, metrics can assume values in a very wide range. We direct you to the metrics section of the documentation for further details.
This file contains the scores assigned to each metric for each transcript. Only metrics which have been used for the scoring will be present. This is the section of the metrics file corresponding to the above GFF3 file:
tid parent score blast_score cdna_length cds_not_maximal cds_not_maximal_fraction combined_cds_fraction combined_cds_intron_fraction combined_cds_length combined_cds_num end_distance_from_junction exon_fraction exon_num five_utr_length five_utr_num highest_cds_exon_number intron_fraction number_internal_orfs proportion_verified_introns retained_fraction retained_intron_num selected_cds_fraction selected_cds_intron_fraction selected_cds_length selected_cds_num source_score three_utr_length three_utr_num
mikado.Chr5G1.2 mikado.Chr5G1 19.0 0.0 0.0 1 1 0.0 1 1 1 1 1 1 0.0 1.0 1 1 1.0 1 1 1 0.0 1 11 0 0.0 1.0
mikado.Chr5G1.1 mikado.Chr5G1 21.89 1.0 1.0 1 1 0.06 1 1 1 1 1 1 0.77 1.0 1 1 1.0 1 1 1 0.06 1 11 0 0.0 1.0
mikado.Chr5G2.2 mikado.Chr5G2 19.04 1 0.0 1 1 0.0 1 1 1 1 1 1 0.04 1.0 1 1 1.0 1 1 1 0.0 1 11 0 0.0 0.0
mikado.Chr5G2.1 mikado.Chr5G2 20.06 1 1.0 1 1 0.03 1 1 1 1 1 1 0.0 1.0 1 1 1.0 1 1 1 0.03 1 11 0 0.0 0.0
mikado.Chr5G3.1 mikado.Chr5G3 19.0 1.0 1.0 1 1 0.0 1.0 1.0 1.0 1 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1 1 1 0.0 1.01.0 1.0 0 0.0 0.0
mikado.Chr5G3.2 mikado.Chr5G3 8.59 0.0 0.0 1 1 0.19 0.0 0.0 0.0 1 0.0 0.0 0.71 0.5 0.0 0.0 1.0 1 1 1 0.19 0.00.0 0.0 0 0.0 0.0
mikado.Chr5G4.1 mikado.Chr5G4 20.0 1 1 1 1 0.0 1 1 1 1 1 1 0.0 0.0 1 1 1.0 1 1 1 0.0 1 11 0 0.0 1.0
mikado.Chr5G5.1 mikado.Chr5G5 20.0 1 1 1 1 0.0 1 1 1 1 1 1 0.0 0.0 1 1 1.0 1 1 1 0.0 1 11 0 0.0 1.0
mikado.Chr5G6.3 mikado.Chr5G6 13.0 1 0.0 1 1 0.0 1 1 1 1 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 1 1 0.0 1 11 0 0.0 0.0
mikado.Chr5G6.2 mikado.Chr5G6 17.1 1 0.76 1 1 0.0 1 1 1 1 0.78 0.78 0.0 0.0 1 0.78 0.0 1.0 1 1 0.0 1 11 0 0.0 0.0
mikado.Chr5G6.1 mikado.Chr5G6 17.9 1 1.0 1 1 0.0 1 1 1 1 1.0 1.0 0.0 0.0 1 1.0 0.0 0.9 1 1 0.0 1 11 0 0.0 0.0
The final score value is obtained by summing all the individual metrics.
Important
If you compare the scores assigned to transcripts at the loci level with those assigned at the subloci level, you will notice that the scores are different and that even some of the raw metrics values are. The former phenomenon is due to the fact that the Mikado scoring system is not absolute but relative; the latter, to the fact that some metrics are locus-dependent, ie their values change due the presence or absence of other transcripts. A typical example is given by the “retained_intron” metrics; retained introns are identified by looking for non-coding regions of transcript which fall inside the intron of another transcript. Changing the transcripts in the locus will change the value associated to this metric, as non-coding sections will or will not be classified as “retained introns”, and therefore the score associated with both the metric and the transcript.
mikado pick
allows to modify some of the parameters regarding the run at runtime. However, some sections - such as most of the settings regarding alternative splicing - are left untouched by the utility, and are best modified by editing the configuration file itself. The available parameters are as follows:
- json-conf: required. This is the configuration file created in the first step of the pipeline.
- gff; optionally, it is possible to point Mikado prepare to the GTF it should use here on the command line. This file should be the output of the preparation step. Please note that this file should be in GTF format, sorted by chromosome and position; if that is not the case, Mikado will fail.
- db: Optionally, it is possible to specify the database to Mikado on the command line, rather than on the configuration file. Currently, this option supports SQLite databases only.
- Options related to how Mikado will treat the data:
- intron_range: this option expects a couple of positive integers, in ascending order, indicating the 98% CI where most intron lengths should fall into. Gene models with introns whose lengths fall outside of this range might be penalized, depending on the scoring system used. If uncertain, it is possible to use the included stats utility on the gene annotation of a closely related species.
- purge: flag. If set, Mikado will not just identify putative fragments - it will completely exclude them from the output.
- flank: for the purposes of identifying fragments, it is useful to consider together loci which are not necessarily overlapping but which are lying relatively near on the genome sequence. This parameter (a positive integer) specifies the maximum distance for Mikado for gathering data together for this purpose.
- mode: how Mikado will treat BLAST and ORF data in the presence of putative chimeras. See the relevant section in the configuration page for details.
- Options regarding the output files:
- output-dir: Output directory. By default, Mikado will write all files and the log on the current directory.
- loci_out: required. This it the main output file, in GFF format.
- prefix: this ID will be prefixed to all gene and transcript models. IN general, IDs will be of the form “<prefix>.<chromosome><progressive ID>”. Default: Mikado.
- source: source field prefix for the output files. Useful for eg loading Mikado runs into WebApollo [Apollo].
- no_cds: if present, this flg will indicate to Mikado not to print out the CDS of selected models but only their transcript structures.
- subloci_out: If requested, Mikado can output the data regarding the first intermediate step, ie the subloci. See the introduction for details.
- monoloci_out: If requested, Mikado can output the data regarding the second intermediate step, ie the monosubloci. See the introduction for details.
- Options regarding the resources to be used:
- procs: number of processors to use.
- start-method: multiprocessing start method. See the explanation on Python multiprocessing
- single: flag. If present, multiprocessing will be disabled.
- shared-memory: flag, available on Unix systems only. If set, Mikado will try to copy the SQLite database in RAM. It might provide a small boost if disk access is a limitation. Ineffective with databases other than SQLite.
- shared-memory-db: if the database has already been copied in memory, its new location can be given with this argument. Useful to have multiple Mikado picks share the same DB.
- preload: flag. If present, Mikado will load the database in memory before execution. Discouraged unless the database size is quite small.
- Options regarding logging:
- log: name of the log file. By default, “pick.log”
- verbose: sets the log level to DEBUG. Please be advised that the debug mode is extremely verbose and is bestly invoked only for real, targeted debugging sessions.
- noverbose: sets the log level to ERROR. If set, in most cases, the log file will be practically empty.
- log-level: this flag directly sets the log level. Available values: DEBUG, INFO, WARNING, ERROR.
Usage:
$ mikado pick --help
usage: Mikado pick [-h] [--start-method {fork,spawn,forkserver}] [-p PROCS]
--json-conf JSON_CONF [--scoring-file SCORING_FILE]
[-i INTRON_RANGE INTRON_RANGE] [--pad]
[--subloci_out SUBLOCI_OUT] [--monoloci_out MONOLOCI_OUT]
[--loci_out LOCI_OUT] [--prefix PREFIX] [--no_cds]
[--source SOURCE] [--flank FLANK] [--purge]
[--subloci-from-cds-only] [--monoloci-from-simple-overlap]
[-db SQLITE_DB] [-od OUTPUT_DIR] [--single] [-l LOG]
[-v | -nv] [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--mode {nosplit,stringent,lenient,permissive,split}]
[gff]
positional arguments:
gff
optional arguments:
-h, --help show this help message and exit
--start-method {fork,spawn,forkserver}
Multiprocessing start method. (default: None)
-p PROCS, --procs PROCS
Number of processors to use. Default: look in the
configuration file (1 if undefined) (default: None)
--json-conf JSON_CONF
JSON/YAML configuration file for Mikado. (default:
None)
--scoring-file SCORING_FILE
Optional scoring file for the run. It will override
(default: None)
-i INTRON_RANGE INTRON_RANGE, --intron-range INTRON_RANGE INTRON_RANGE
Range into which intron lengths should fall, as a
couple of integers. Transcripts with intron lengths
outside of this range will be penalised. Default: (60,
900) (default: None)
--pad Whether to pad transcripts in loci. (default: False)
--subloci_out SUBLOCI_OUT
--monoloci_out MONOLOCI_OUT
--loci_out LOCI_OUT This output file is mandatory. If it is not specified
in the configuration file, it must be provided here.
(default: None)
--prefix PREFIX Prefix for the genes. Default: Mikado (default: None)
--no_cds Flag. If set, not CDS information will be printed out
in the GFF output files. (default: None)
--source SOURCE Source field to use for the output files. (default:
None)
--flank FLANK Flanking distance (in bps) to group non-overlapping
transcripts into a single superlocus. Default:
determined by the configuration file. (default: None)
--purge Flag. If set, the pipeline will suppress any loci
whose transcripts do not pass the requirements set in
the JSON file. (default: False)
--subloci-from-cds-only
"Flag. If set, Mikado will only look for overlap in
the coding features when clustering transcripts
(unless one transcript is non-coding, in which case
the whole transcript will be considered). Default:
False, Mikado will consider transcripts in their
entirety. (default: False)
--monoloci-from-simple-overlap
"Flag. If set, in the final stage Mikado will cluster
transcripts by simple overlap, not by looking at the
presence of shared introns. Default: False. (default:
False)
-db SQLITE_DB, --sqlite-db SQLITE_DB
Location of an SQLite database to overwrite what is
specified in the configuration file. (default: None)
-od OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory. Default: current working directory
(default: None)
--single Flag. If set, Creator will be launched with a single
process. Useful for debugging purposes only. (default:
False)
--mode {nosplit,stringent,lenient,permissive,split}
Mode in which Mikado will treat transcripts with
multiple ORFs. - nosplit: keep the transcripts whole.
- stringent: split multi-orf transcripts if two
consecutive ORFs have both BLAST hits and none of
those hits is against the same target. - lenient:
split multi-orf transcripts as in stringent, and
additionally, also when either of the ORFs lacks a
BLAST hit (but not both). - permissive: like lenient,
but also split when both ORFs lack BLAST hits - split:
split multi-orf transcripts regardless of what BLAST
data is available. (default: None)
Log options:
-l LOG, --log LOG File to write the log to. Default: decided by the
configuration file. (default: None)
-v, --verbose Flag. If set, the debug mode will be activated.
(default: False)
-nv, --noverbose Flag. If set, the log will report only errors and
critical events. (default: False)
-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Logging level. Default: retrieved by the configuration
file. (default: None)
mikado pick
uses a divide-et-impera algorithm to find and analyse loci separately. As the data to be integrated with the transcripts is stored on the database rather than be calculated on the fly, rerunning pick
with different options takes little time and resources.
To keep the data sorted, Mikado will write out temporary files during the operation and merge them at the end of the run (see function merge_loci_gff
in the picking module.
The Mikado pipeline is composed of four different stages, that have to be executed serially:
- Mikado configure, for creating the configuration file that will be used throughout the run.
- Mikado prepare, for collapsing the input assemblies into a single file. After this step, it is possible to perform additional analyses on the data such as TransDecoder (highly recommended), Portcullis, or BLAST.
- Mikado serialise, to gather all external data into a single database.
- Mikado pick, to perform the actual selection of the best transcripts in each locus.
Compare¶
Compare¶
- This Mikado utility allows the user to compare the transcripts from any two annotations. Its output allows:
- To understand which reference transcript each prediction is most similar to
- To understand which prediction transcript best represent each reference model
- To have a summary information about the similarity between the two annotations.
Mikado compare has been directly inspired by the popular Cuffcompare [Cufflinks] utility and by ParsEval [ParsEval]. Please note that while superficially similar to Cuffcompare in the style of the output files, Mikado compare is more philosophically similar to ParsEval, as it will not try to aggregate transcripts in loci but will perform a pure comparison between the two annotation files. Both GTF and GFF files are accepted, in any combination.
Mikado compare is invoked by specifying the reference annotation and the desired mode of analysis. There are three possible options:
- In its default mode, compare will ask for a prediction annotation to compare the reference against.
- In the “self” mode, compare will do a self-comparison of the reference against itself, excluding as possible results the matches between a transcript and itself. It can be useful to glean the relationships between transcripts and genes in an annotation.
- In the “internal” mode of operations, compare will again perform a self-comparison, focussed on multi-isoform genes. For those, compare will perform and report all possible comparisons. It is useful to understand the relationships between the transcripts in a single locus.
Mikado stores the information of the reference in a specialised index, with a “.midx” suffix, which will be created by the program upon its first execution with a new reference. If the index file is already present, Mikado will try to use it rather than read again the annotation.
$ mikado compare --help
usage: Mikado compare [-h] -r REFERENCE
(-p PREDICTION | --self | --internal | --index)
[--distance DISTANCE] [-pc] [-o OUT] [--lenient] [-eu]
[-n] [-l LOG] [-v]
optional arguments:
-h, --help show this help message and exit
--distance DISTANCE Maximum distance for a transcript to be considered a
polymerase run-on. Default: 2000
-pc, --protein-coding
Flag. If set, only transcripts with a CDS (both in
reference and prediction) will be considered.
-o OUT, --out OUT Prefix for the output files. Default: mikado_compare
--lenient If set, exonic statistics will be calculated leniently
in the TMAP as well - ie they will consider an exon as
match even if only the internal junction has been
recovered.
-eu, --exclude-utr Flag. If set, reference and prediction transcripts
will be stripped of their UTRs (if they are coding).
-n, --no-index, --no-save-index
Unless this flag is set, compare will save an index of
the reference to quicken multiple calls.
-l LOG, --log LOG
-v, --verbose
Prediction and annotation files.:
-r REFERENCE, --reference REFERENCE
Reference annotation file. By default, an index will
be crated and saved with the suffix ".midx".
-p PREDICTION, --prediction PREDICTION
Prediction annotation file.
--self Flag. If set, the reference will be compared with
itself. Useful for understanding how the reference
transcripts interact with each other.
--internal Flag. If set, for each gene with more than one
transcript isoform each will be compared to the
others. Useful for understanding the structural
relationships between the transcripts in each gene.
--index Flag. If set, compare will stop after having generated
the GFF index for the reference.
Mikado compare produces two tabular files, tmap and refmap, and one statistics file.
TMAP are tabular files that store the information regarding the best match for each prediction in the reference. The columns are as follows:
- ref_id: Transcript ID of the matched reference model(s).
- ref_gene: Gene ID of the matched reference model(s).
- ccode: class code of the match. See the relevant section on Class codes.
- tid: Transcript ID of the prediction model.
- gid: Gene ID of the prediction model.
- tid_num_exons: Number of exons of the prediction model.
- ref_num_exons: Number of exons of the reference model.
- n_prec: Nucleotide precision of the prediction ( TP / (length of the prediction))
- n_recall: Nucleotide recall of the reference (TP / (length of the reference))
- n_f1: F1 of recall and precision at the nucleotide level.
- j_prec: Splice junction precision of the prediction model ( TP / (number of splice sites in the prediction))
- j_recall: Splice junction recall of the reference model ( TP / (number of splice sites in the reference))
- j_f1: F1 of recall and precision at the splice junction level.
- e_prec: Exon precision of the prediction model ( TP / (number of exons in the prediction)). NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
- e_recall: Exon recall of the reference model ( TP / (number of exons in the reference))
- e_f1: F1 of recall and precision at the exon level.
- distance: Distance of the model from its putative match.
- location: location of the match, with the format <chromosome>:<start>..<end>
An example of TMAP file is as follows:
ref_id ref_gene ccode tid gid tid_num_exons ref_num_exons n_prec n_recall n_f1 j_prec j_recall j_f1 e_prec e_recall e_f1 distance location
AT5G66600.2 AT5G66600 = cuff_cufflinks_star_at.23553.1 cuff_cufflinks_star_at.23553.1.gene 9 9 91.30 81.31 86.02 100.00 100.00 100.00 77.78 77.78 77.78 0 Chr5:26575000..26578163
AT5G66600.2 AT5G66600 C cl_Chr5.6272 cl_Chr5.6272.gene 7 9 94.95 72.43 82.18 100.00 75.00 85.71 85.71 66.67 75.00 0 Chr5:26575000..26578087
AT5G66620.1,AT5G66630.1,AT5G66631.1 AT5G66620,AT5G66630,AT5G66631 f,j,j,G st_Stringtie_STAR.21710.15 st_Stringtie_STAR.21710.15.gene 8 11,10,1 19.13,19.95,35.98 54.57,45.65,100.00 28.33,27.76,52.92 28.57,64.29,0.00 20.00,50.00,0.00 23.53,56.25,0.00 12.50,37.50,0.00 9.09,30.00,0.00 10.53,33.33,0.00 0 Chr5:26588402..26598231
You can notice that the third example is particular as the prediction transcript matches not one but multiple reference transcripts. This is a fusion event.
RefMap files are tabular files which store the information regarding the best match for each reference transcript, among all possible prediction models. The columns of the file are as follows:
- ref_id: Transcript ID of the reference model.
- ccode: class code of the match. See the relevant section on Class codes.
- tid: Transcript ID of the prediction model.
- gid: Gene ID of the prediction model.
- nF1: F1 of recall and precision at the nucleotide level.
- jF1: F1 of recall and precision at the splice junction level.
- eF1: F1 of recall and precision at the exon level. NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
- ref_gene: Gene ID of the reference model.
- best_ccode: Best possible class code found for any of the transcripts of the gene.
- best_tid: Transcript ID of the prediction model which fit best one of the transcript models of the reference gene.
- best_gid: Gene ID of the prediction model which fit best one of the transcript models of the reference gene.
- best_nF1: F1 of recall and precision at the nucleotide level, for the best possible comparison.
- best_jF1: F1 of recall and precision at the splice junction level, for the best possible comparison.
- best_eF1: F1 of recall and precision at the exon level, for the best possible comparison.
- location: location of the match, with the format <chromosome>:<start>..<end>
An example of a RefMap file is as follows:
ref_id ccode tid gid nF1 jF1 eF1 ref_gene best_ccode best_tid best_gid best_nF1 best_jF1 best_eF1 location
AT5G66610.1 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 AT5G66610 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 Chr5:26584780..26587912
AT5G66610.2 J mikado.Chr5G4.2 mikado.Chr5G4 93.91 94.74 76.19 AT5G66610 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 Chr5:26584774..26587912
AT5G66620.1 j mikado.Chr5G6.1 mikado.Chr5G6 85.51 95.0 72.73 AT5G66620 j mikado.Chr5G6.1 mikado.Chr5G6 85.51 95.0 72.73 Chr5:26588402..26592423
AT5G66630.1 n mikado.Chr5G8.2 mikado.Chr5G8 93.27 94.74 76.19 AT5G66630 n mikado.Chr5G8.2 mikado.Chr5G8 93.27 94.74 76.19 Chr5:26591981..26595922
Please note that the third example (AT5G66630.1) has as best possible match a fusion event.
These files provide a summary of the comparison between the reference and the annotation. An example is as follows:
Command line:
/usr/users/ga002/venturil/py351/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in 8 genes
--------------------------------- | Sn | Pr | F1 |
Base level: 85.74 64.73 73.77
Exon level (stringent): 63.83 42.86 51.28
Exon level (lenient): 80.00 52.94 63.72
Intron level: 89.47 59.65 71.58
Intron chain level: 33.33 14.29 20.00
Transcript level (stringent): 0.00 0.00 0.00
Transcript level (>=95% base F1): 28.57 13.33 18.18
Transcript level (>=80% base F1): 42.86 20.00 27.27
Gene level (100% base F1): 0.00 0.00 0.00
Gene level (>=95% base F1): 40.00 25.00 30.77
Gene level (>=80% base F1): 60.00 37.50 46.15
# Matching: in prediction; matched: in reference.
Matching intron chains: 2
Matched intron chains: 2
Matching monoexonic transcripts: 1
Matched monoexonic transcripts: 1
Total matching transcripts: 3
Total matched transcripts: 3
Missed exons (stringent): 17/47 (36.17%)
Novel exons (stringent): 40/70 (57.14%)
Missed exons (lenient): 9/45 (20.00%)
Novel exons (lenient): 32/68 (47.06%)
Missed introns: 4/38 (10.53%)
Novel introns: 23/57 (40.35%)
Missed transcripts: 0/7 (0.00%)
Novel transcripts: 6/15 (40.00%)
Missed genes: 0/5 (0.00%)
Novel genes: 2/8 (25.00%)
The first section of the file describes:
- Concordance of the two annotations at the base level (recall, precision, and F1)
- Concordance of the two annotation at the exonic level (recall, precision, and F1), in two ways:
- “stringent”: only perfect exonic matches are considered.
- “lenient”: in this mode, terminal exons are counted as a match if the internal border is matched. See the RGASP paper [RGASP] for details on the rationale.
- Concordance of the two annotations at the intron level.
- Concordance of the two annotations at the intron chain level - how many intron chains of the reference are found identical in the prediction. Only multiexonic models are considered for this level.
- Concordance of the two annotations at the transcript level, in three different modes:
- “stringent”: in this mode, only perfect matches are considered.
- “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
- “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
- Concordance of the two annotations at the gene level, in three different modes:
- “stringent”: in this mode, we consider reference genes for which it was possible to find at least one perfect match for one of its transcripts.
- “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
- “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
In the second section, the file reports how many of the intron chains, monoexonic transcripts and total transcripts in the reference were matched by at least one matching prediction transcript. Finally, in the third section the file reports the number of missed (present in the reference but not in the prediction) or novel (viceversa - present in the prediction but not in the reference) features.
Note
Please note that a gene might be considered as “found” even if its best match is intronic, on the opposite strand, or not directly overlapping it, or is in the opposite strand (see next section, in particular the Intronic, Fragment and No overlap categories).
In addition to recall, precision and F1 values, Mikado assign each comparison between two transcripts a class code, which summarises the relationship between the two transcripts. The idea is lifted from the popular tool Cuffcompare, although Mikado greatly extends the catalogue of possible class codes. All class codes fall within one of the following categories:
- Match: class codes of this type indicate concordance between the two transcript models.
- Extension: class codes of this type indicate that one of the two models extends the intron chain of the other, without internal interruptions. The extension can be from either perspective - either the prediction extends the reference, or it is instead contained within the reference (so that switching perspectives, the reference would “extend” the prediction).
- Alternative splicing: the two exon chains overlap but differ in significant ways.
- Intronic: either the prediction is completely contained within the introns of the reference, or viceversa.
- Overlap: the two transcript models generically overlap on their exonic sequence.
- Fragment: the prediction is a fragment of the reference, in most cases because they are on opposite strands.
- No overlap: the prediction and the reference are near but do not directly overlap.
- Fusion: this special class code is a qualifier and it never appears on its own. When a transcript is defined as a fusion, its class code in the tmap file will be an “f” followed by the class codes of the individual transcript matches, sperated by comma. So a prediction which matches two reference models, one with a “j” and another with a “o”, will have a class code of “f,j,o”. In the refmap file, if the fusion is the best match, the class code will be “f” followed by the class code for the individual reference transcript; e.g., “f,j”
Available class codes
Class code | Definition | Reference multiexonic? | Prediction multiexonic? | Nucleotide: RC, PC, F1 | Junction: RC, PC, F1 | Reverse | Category |
---|---|---|---|---|---|---|---|
= | Complete intron chain match. | True | True | NA | 100%, 100%, 100% | = | Match |
_ | Complete match between two monoexonic transcripts. | False | False | NA, NA, >=80% | NA | _ | Match |
n | Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites outside of the reference transcript boundaries. | True | True | 100%, < 100%, <100% | 100%, < 100%, <100% | c | Extension |
J | Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites inside of the reference transcript boundaries. | True | True | 100%, <= 100%, <100% | 100%, < 100%, <100% | C | Extension |
c | The prediction is either multiexonic and with its intron chain completely contained within that of the reference, or monoexonic and contained within one of the reference exons. | NA | NA | < 100%, 100%, NA | < 100%, 100%, NA | n | Extension |
C | The prediction intron chain is completely contained within that of the reference transcript, but it partially debords either into its introns or outside of the reference boundaries. | True | True | <= 100%, < 100%, < 100% | < 100%, 100%, < 100% | J or j | Extension |
j | Alternative splicing event. | True | True | NA | <= 100%, 100%, < 100% | j or C | Alternative splicing |
h | Structural match between two models where where no splice site is conserved but at least one intron of the reference and one intron of the prediction partially overlap. | True | True | > 0%, > 0%, > 0% | 0%, 0%, 0% | h | Alternative splicing |
g | The monoexonic prediction overlaps one or more exons of the reference transcript; the borders of the prediction cannot fall inside the introns of the reference. The prediction transcript can bridge multiple exons of the reference model. | True | False | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | G | Alternative splicing |
G | Generic match of a multiexonic prediction transcript versus a monoexonic reference. | False | True | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | g | Alternative splicing |
o | Generic overlap between two multiexonic transcripts, which do not share any overlap among their introns. | True | True | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | o | Overlap |
e | Single exon transcript overlapping one reference exon and at least 10 bps of a reference intron, indicating a possible pre-mRNA fragment. | True | False | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | G | Overlap |
m | Generic match between two monoexonic transcripts. | False | False | NA, NA, < 80% | NA | m | Overlap |
i | Monoexonic prediction completely contained within one intron of the reference transcript. | True | False | 0%, 0%, 0% | 0%, 0%, 0% | ri | Intronic |
I | Prediction completely contained within the introns of the reference transcript. | True | True | 0%, 0%, 0% | 0%, 0%, 0% | rI | Intronic |
ri | Reverse intron transcript - the monoexonic reference is completely contained within one intron of the prediction transcript. | False | True | 0%, 0%, 0% | 0%, 0%, 0% | i | Intronic |
rI | Multiexonic reference completely contained within the introns of the prediction transcript. | True | True | 0%, 0%, 0% | 0%, 0%, 0% | I | Intronic |
f | Fusion - this special code is applied when a prediction intersects more than one reference transcript. To be considered for fusions, candidate references must either share at least one splice junction with the prediction, or have at least 10% of its bases recalled. If two or more reference transcripts fit these constraints, then the prediction model is classified as a fusion. | NA | NA | > 10%, 0%, 0% | > 0%, 0%, 0% | NA | Fusion |
x | Monoexonic match on the opposite strand. | NA | False | >0%, >0%, >0% | 0%, 0%, 0% | x or X | Fragment |
X | Multiexonic match on the opposite strand. | NA | True | >0%, >0%, >0% | NA | x or X | Fragment |
p | The prediction is on the same strand of a neighbouring but non-overlapping transcript. Probable polymerase run-on | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | p | Fragment |
P | The prediction is on the opposite strand of a neighbouring but non- overlapping transcript. Probable polymerase run-on. | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | P | Fragment |
u | Unknown - no suitable model has been found near enough the prediction to perform a comparison. | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | NA | Unknown |
Mikado compare conceptualizes the reference annotation as a collection of interval trees, one per chromosome or scaffold, where each node corresponds to an array of genes at the location. The gene and transcript objects are stored separately. The location of each transcript model in the prediction is queried against the tree, with a padding (default 2kbps) to allow for neighouring but non-overlapping genes, and the transcript itself is subsequently compared with each reference transcript contained in the hits. Each comparison will yield precision, recall and F1 values for the nucleotide, splice junction and exonic levels, together with an associated class code. The best match for the prediction is selected for by choosing the comparison yielding the best splice junction F1 and the best nucleotide F1, in this order. If the prediction transcript overlaps two or more genes on the same strand, and for at least two it has one match each with either 10% nucleotide recall or junction recall over 0%, it is deemed as a fusion event, and its line in the tmap file will report the best match against each of the fused genes, separated by comma.
Each calculated match against a reference transcript is stored as a potential best match for the reference transcript. At the end of the run, the hits for each reference transcript will be ordered using the following function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | @staticmethod
def result_sorter(result):
"""
Method to sort the results for the refmap. Order:
- CCode does not contain "x", "P", "p" (i.e. fragments on opposite strand or
polymerase run-on fragments)
- Exonic F1 (e_f1)
- Junction F1 (j_f1)
- "f" in ccode (i.e. transcript is a fusion)
- Nucleotide F1 (n_f1)
:param result: a resultStorer object
:type result: ResultStorer
:return: (int, float, float, float)
"""
bad_ccodes = ["x", "X", "P", "p"]
bad_ccodes = set(bad_ccodes)
orderer = (len(set.intersection(bad_ccodes, set(result.ccode))) == 0,
result.j_f1, result.e_f1,
result.n_f1,
"f" in result.ccode)
return orderer
|
This function is used to select both for the best match for the transcript, as well as to select among these matches for the best match for the gene.
The interval tree data structure is created using Cython code originally part of the bx-python, kindly provided by Dr. Taylor for modification and inclusion in Mikado. The code has been slightly modified for making it Python3 compliant.
The .midx files storing the annotation for repeated compare runs are gzip-compressed JSON files. In them, Mikado will store for each gene its coordinates, its transcripts, and the location of exons and CDS features. MIDX files make repeated runs quite faster, as the program will not have to re-parse the GFF file.
The comparison code is written in Cython and is crucial during the picking phase of Mikado, not just for the functioning of the comparison utility.
Mikado also comprises a dedicated utility, Mikado compare, to assess the similarity between two annotations.
Daijin¶
The Daijin pipeline for driving Mikado¶
No emperor or empress can lead its nation without a trusty chancellor to help him or her in organising the bureaucracy. Daijin, the Japanese minister, has the same role in Mikado - it smooths the path to go from a collection of read inputs (both RNA-Seq or long reads) to a polished transcriptome assembly. The pipeline is based on Snakemake [Snake]; while Snakemake can support any scheduling system, our pipeline manager currently supports only three (SLURM, PBS and LSF), plus any DRMAA-compliant batch submission system. Other schedulers can be added upon request.
This page contains a detailed explanation on how to use Daijin. We also provide a tutorial that will guide you through using the manager for analysing an RNA-Seq sample for Sc. pombe.
Hint
It is possible to launch the two steps of the pipeline directly with Snakemake, using the snakefiles located in Mikado.daijin: tr.snakefile
for the first step, and mikado.snakefile
for the second.
daijin configure
creates the configuration file that will drive Daijin, in YAML format. Most options can be specified by command line. Available parameters for the command line are:
- out: Output file, ie the configuration file for Daijin.
- od: Directory that Daijin will use as master for its jobs.
- genome: Genome FASTA file. Required.
- transcriptome: Optional reference transcriptome, to be used for alignment and assembly. Please note that Mikado will not use the reference annotation, it will only be used during the alignment and assembly steps.
- name: Name to be applied to the genome, eg the species.
Warning
The name must avoid containing spaces or any non-standard character. This string will be used among other things to determine the name of the index to be used by the aligners; if the string contains non-valid characters it will cause them - and on cascade Daijin - to fail.
- prot-db: this parameter specifies a protein database to be used for BLAST. If none is specified, this step will be omitted.
- aligners: aligner(s) to be used during the run. Currently, Daijin supports the following aligners:
- gsnap
- star
- tophat2
- hisat2
- assemblers: assembler(s) to be used during the run. Currently, Daijin supports the following RNA-Seq assemblers:
- cufflinks
- class2
- stringtie
- trinity
- threads: Number of threads to be requested for parallelisable steps.
- modes: Daijin can run Mikado in multiple modes regarding the handling of putative chimeras. Specify those you desire to execute here.
- flank: Amount of flanking that Mikado will allow while looking for fragments around good gene loci. Default 1000 bps. It is advisable to reduce it in species with compact genomes.
- scheduler: if Daijin has to execute the pipeline on a cluster (potentially using DRMAA), it is necessary to specify the scheduler here. At the moment we support the following widely used schedulers: PBS, LSF, SLURM.
- r1, r2, samples: RNA-Seq reads 1, RNA-Seq reads 2, and sample name. At least one of each is required.
- strandedness: if not specified, all samples will be assumed to be unstranded. Specify it as you would with HISAT or TopHat2.
- cluster_config: if specified, a sample cluster configuration file will be copied in the specified location. A cluster configuration file has the following structure:
__default__:
threads: 4
memory: 10000
queue: tgac-medium
asm_trinitygg:
memory: 20000
As it can be seen, it is a YAML format with two fields: __default__ and asm_trinitygg. The former specifies common parameters for all the steps: the queue to be used, the number of threads to request, and the memory per job. The “asm_trinitygg” field specifies particular parameters to be applied to the asm_trinitygg rule - in this case, increasing the requested memory (Trinity is a de novo assembler and uses much more memory than most reference-guided assemblers). Please note that the number of threads specified on the command line, or in the configuration file proper, takes precedence over the equivalent parameter in the cluster configuration file.
$ daijin configure --help
usage: daijin configure [-h] [-c CLUSTER_CONFIG] [--threads N] [-od OUT_DIR]
[-o OUT] [--scheduler {,SLURM,LSF,PBS}] [--name NAME]
--genome GENOME [--transcriptome TRANSCRIPTOME]
[-r1 R1 [R1 ...]] [-r2 R2 [R2 ...]]
[-s SAMPLES [SAMPLES ...]]
[-st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]]
-al
[{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
-as
[{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
[--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml,spombe.yaml}]
[--copy-scoring COPY_SCORING]
[-m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]]
[--flank FLANK] [--prot-db PROT_DB [PROT_DB ...]]
optional arguments:
-h, --help show this help message and exit
-al [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]], --aligners [{gsnap,star,hisat,tophat2} [{gsnap,star,hisat,tophat2} ...]]
Aligner(s) to use for the analysis. Choices: gsnap,
star, hisat, tophat2
-as [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]], --assemblers [{class,cufflinks,stringtie,trinity} [{class,cufflinks,stringtie,trinity} ...]]
Assembler(s) to use for the analysis. Choices: class,
cufflinks, stringtie, trinity
Options related to how to run Daijin - threads, cluster configuration, etc.:
-c CLUSTER_CONFIG, --cluster_config CLUSTER_CONFIG
Cluster configuration file to write to.
--threads N, -t N Maximum number of threads per job. Default: 4
-od OUT_DIR, --out-dir OUT_DIR
Output directory. Default if unspecified: chosen name.
-o OUT, --out OUT Output file. If the file name ends in "json", the file
will be in JSON format; otherwise, Daijin will print
out a YAML file. Default: STDOUT.
--scheduler {,SLURM,LSF,PBS}
Scheduler to use. Default: None - ie, either execute
everything on the local machine or use DRMAA to submit
and control jobs (recommended).
Arguments related to the reference species.:
--name NAME Name of the species under analysis.
--genome GENOME, -g GENOME
Reference genome for the analysis, in FASTA format.
Required.
--transcriptome TRANSCRIPTOME
Reference annotation, in GFF3 or GTF format.
Arguments related to the input paired reads.:
-r1 R1 [R1 ...], --left_reads R1 [R1 ...]
Left reads for the analysis. Required.
-r2 R2 [R2 ...], --right_reads R2 [R2 ...]
Right reads for the analysis. Required.
-s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
Sample names for the analysis. Required.
-st {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...], --strandedness {fr-unstranded,fr-secondstrand,fr-firststrand} [{fr-unstranded,fr-secondstrand,fr-firststrand} ...]
Strandedness of the reads. Specify it 0, 1, or number
of samples times. Choices: fr-unstranded, fr-
secondstrand, fr-firststrand.
Options related to the Mikado phase of the pipeline.:
--scoring {insects.yaml,human.yaml,plants.yaml,worm.yaml}
Available scoring files.
--copy-scoring COPY_SCORING
File into which to copy the selected scoring file, for
modification.
-m {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...], --modes {nosplit,split,permissive,stringent,lenient} [{nosplit,split,permissive,stringent,lenient} ...]
Mikado pick modes to run. Choices: nosplit, split,
permissive, stringent, lenient
--flank FLANK Amount of flanking for grouping transcripts in
superloci during the pick phase of Mikado.
--prot-db PROT_DB [PROT_DB ...]
Protein database to compare against, for Mikado.
Warning
if DRMAA is requested and no scheduler is specified, Daijin will fail. For this reason, Daijin requires a scheduler name. If one is not provided, Daijin will fall back to local execution.
Daijin can be configured so to run each assembler and/or aligner multiple times, with a different set of parameters each. This is achieved by specifying the additional command line arguments in the array for each program. For example, in this section:
align_methods:
hisat:
- ''
- "-k 10"
asm_methods:
class:
- ''
- "--set-cover"
stringtie:
- ''
- "-T 0.1"
we are asking Daijin to execute each program twice:
- HISAT2: once with default parameters, once reporting the 10 best alignments for each pair
- CLASS2: once with default parameters, once using the “set-cover” algorithm
- Stringtie: once with default parameters, once lowering the minimum expression to 0.1 TPM.
If you are running Daijin on a cluster and the software tools have to be sourced or loaded, you can specify the versions and command to load in the configuration file itself. Edit the file at this section:
load:
# Commands to use to load/select the versions of the programs to use. Leave an empty
# string if no loading is necessary.
blast: 'source blast-2.3.0'
class: 'source class-2.12'
cufflinks: ''
gmap: ''
hisat: 'source HISAT-2.0.4'
mikado: 'souce mikado-devel'
portcullis: 'source portcullis-0.17.2'
samtools: 'source samtools-1.2'
star: ''
stringtie: 'source stringtie-1.2.4'
tophat: ''
transdecoder: 'source transdecoder-3.0.0'
trinity: ''
In this case, the cluster requires to load software on-demand using the source command, and we specify the versions of the programs we wish to use.
Regarding read alignment, it is also possible to specify the minimum and maximum permissible intron lengths:
short_reads:
# Parameters related to the reads to use for the assemblies. Voices:
# - r1: array of left read files.
# - r2: array of right read files. It must be of the same length of r1; if one
# one or more of the samples are single-end reads, add an empty string.
# - samples: array of the sample names. It must be of the same length of r1.
# - strandedness: array of strand-specificity of the samples. It must be of the
# same length of r1. Valid values: fr-firststrand, fr-secondstrand, fr-unstranded.
max_intron: 10000
min_intron: 20
r1:
- SRR1617247_1.fastq.gz
r2:
- SRR1617247_2.fastq.gz
samples:
- SRR1617247
strandedness:
- fr-secondstrand
Regarding the Mikado stage of Daijin, the configuration file contains all the fields that can be found in a normal Mikado configuration file. All mikado-specific parameters are stored under the “mikado” field. It is also possible to modify the following:
- blastx: these are parameters regarding the running of BLASTX. This field contains the following variables:
- prot-db: this is an array of FASTA files to use as protein databases for the BLASTX step.
- chunks: number of chunks to divide the BLASTX into. When dealing with big input FASTA files and with a cluster at disposal, it is more efficient to chunk the input FASTA in multiple smaller files and execute BLASTX on them independently. The default number of chunks is 10. Increase it as you see fit - it often makes sense, especially on clusters, to launch a multitude of small jobs rather than a low number of big jobs.
- evalue: maximum e-value for the searches.
- max_target_seqs: maximum number of hits to report.
- transdecoder: parameters related to transdecoder. At the moment only one is available:
- min_protein_len: minimum protein length that TransDecoder should report. The default value set by Mikado, 30, is much lower than the default (100) and this is intentional, as the chimera splitting algorithm relies on the capability of TransDecoder of finding incomplete short ORFs at different ends of a transcript.
Daijin will organise the output directory in 5 major sections, plus the configuration file for the Mikado step:
- 1-reads: this folder contains links to the original read files.
- 2-alignments: this folder stores the indices built for each aligner, and the results. The structure is as follows:
- output: this folder contains the final BAM files, sorted and indexed.
- alignments.stats: this file contains a table reporting the most salient parameters derived from
samtools stats
calls onto the multiple BAM files. - One folder per aligner, inside which are present:
- index folder, containing the genome indices for the tool
- One folder per sample, containing the output BAM file.
- 3-assemblies: this folder contains the RNA-Seq assemblies which will be used as input by Mikado. The structure is as follows:
- output: this folder contains the final GTF/GFF3 assemblies. For each combination of aligner/assembler that has been requested, Daijin will here provide:
- the GTF file
- a statistics file derived using
mikado util stats
(see the section on this utility for details)
- assembly.stats: a tabular file collecting the most salient data from the statistics files generated for each assembly
- One folder per assembly, containing tool-specific files and the final assembly.
- 4-portcullis: this folder contains the results of Portcullis, if its execution has been requested. The folder will contain the following:
- output: folder which contains a merged BED file of reliable junctions, creating by merging all junctions from all alignments.
- One folder per alignment analysed. We redirect you to the documentation of the tool for more details.
- mikado.yaml: final output file of the assemble part of the pipeline. This file will act both as the configuration for Daijin and for Mikado; for a description of the Mikado specific fields, we remand to the section on the configuration of the tool.
- 5-mikado: this folder contains the results for mikado. It is organised as follows:
- a link to the genome FASTA, and corresponding FAI file (generated with samtools)
- Files created by the prepare step:
- mikado_prepared.fasta
- mikado_prepared.gtf
- prepare.log
- transdecoder: this folder contains the results of the TransDecoder run against the mikado_prepared.fasta file. Mikado will use the file transcripts.fasta.transdecoder.bed as source for the ORF location.
- blast: this folder contains the BLAST data. In particular:
- index: this folder contains the FASTA file of the protein database, and the BLAST database files.
- fastas: Daijin will split mikado_prepared.fasta into multiple files, for easier runs onto a cluster. This folder contains the splitted FASTAs.
- xmls: this folder contains the XML files corresponding to the BLASTs of the files present in fastas
- logs: this folder contains the log files corresponding to the BLASTs of the files present in fastas
- pick: this folder contains the results of Mikado pick. It is organissed as follows:
- One folder per requested Mikado chimera-splitting mode. Inside each folder, it is possible to find:
- mikado-{mode}.loci.gff3: Final GFF3 output file.
- mikado-{mode}.metrics.gff3: Final metrics output file, containing the metrics of the transcripts that have been selected.
- mikado-{mode}.scores.gff3: Final metrics output file, containing the scores associated to the evaluated metrics, for each of the selected transcripts.
- mikado-{mode}.loci.stats: statistics file derived using
mikado util stats
(see the section on this utility for details)
- comparison.stats: this tabular file collects the most salient data from the statistics files generated for each pick mode.
Daijin executes the pipeline in two distinct phases, assemble and mikado. Both commands have the same command line interface, namely:
$ daijin assemble --help
usage: daijin assemble [-h] [-c HPC_CONF] [-d] [--jobs N] [--cores [N]]
[--threads N] [--no_drmaa] [--rerun-incomplete]
[--forcerun TARGET [TARGET ...]] [--detailed-summary]
[--list] [--dag]
config
positional arguments:
config Configuration file to use for running the transcript
assembly pipeline.
optional arguments:
-h, --help show this help message and exit
-c HPC_CONF, --hpc_conf HPC_CONF
Configuration file that allows the user to override
resource requests for each rule when running under a
scheduler in a HPC environment.
-d, --dryrun Do a dry run for testing.
--jobs N, -J N Maximum number of cluster jobs to execute
concurrently.
--cores [N], -C [N] Use at most N cores in parallel (default: 1000).
--threads N, -t N Maximum number of threads per job. Default: None (set
in the configuration file)
--no_drmaa, -nd Use this flag if you wish to run without DRMAA, for
example, if running on a HPC and DRMAA is not
available, or if running locally on your own machine
or server.
--rerun-incomplete, --ri
Re-run all jobs the output of which is recognized as
incomplete.
--forcerun TARGET [TARGET ...], -R TARGET [TARGET ...]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.
--detailed-summary, -D
Print detailed summary of all input and output files
--list, -l List resources used in the workflow
--dag Do not execute anything and print the redirected
acylic graph of jobs in the dot language.
The available command parameters are:
- config: the configuration file.
- hpc_conf: cluster configuration file.
- jobs: Maximum number of jobs that can be executed (if Daijin is in local mode) or be present in the submission queue (if Daijin is in DRMAA/cluster mode) at any one time.
- dryrun: do not execute, just list all the commands that will be executed. Useful also for listing the rules that have to be executed.
- cores: Maximum number of cores that Daijin can claim at any one time.
- threads: Maximum number of cores/threads that can be assigned to any step of the pipeline.
- rerun-incomplete: Ask Snakemake to check which steps have produced empty or incomplete output files, and re-execute them and all the downstream commands.
- forcerun: force the re-run of a
In the first step of the pipeline, Daijin will perform the following operations for each of the read datasets provided:
- Create the necessary indices for each of the aligner programs requested.
- Align the read dataset using all the different tools requested, in all the possible combinations of parameters requested. * For example, it is possible to ask each dataset to be aligned twice with TopHat2 - once with the “micro-exon” mode activated, the second time without. Both alignments will be run independently. * It is possible to specify which datasets are strand-specific and which are not, and moreover, it is possible to specify the kind of strand-specificity (fr-secondstrand, fr-firststrand).
- Call all the reliable junctions across the alignments using Portcullis.
- Create the statistics for the assembly using
samtools stat
, and merge them together in a single file. - Assemble each alignment with all the tools requested, in all the parameter combinations desired.
- Call the statistics on each assembly using mikado util stats, and merge them together in a single file.
- Create the configuration file for Mikado.
So during this first step Daijin will go from raw reads files to multiple assemblies, and configure Mikado for the second step.
Assembly pipeline, as driven by Daijin
Example of a pipeline to assemble a single paired-end read dataset using one aligner (Hisat [Hisat]) and two different RNA-Seq assemblers (StringTie [StringTie] and CLASS2 [Class2] ). Reliable junctions from the three alignments are called and merged together using Portcullis.
In this step, the Daijin manager will execute all the steps necessary to perform Mikado on the desired inputs. The manager will execute the following steps:
- Merge all the input assemblies together using Mikado prepare
- Execute TransDecoder [Trinity] on the transcript sequences, to retrieve their ORFs.
- Split the FASTA file in as many chunks as specified during configuration, and analyse them separately
- Execute BLASTX+ [Blastplus] on the splitted FASTAs, creating BLAST XML outputs.
- Run Mikado serialise to load the BLAST results, TransDecoder ORFs, and portcullis junctions into a single database.
- Run Mikado pick on the data, in the selected modes.
- Collate and collapse the statistics for each of the filtered assemblies.
daijin mikado
by default should use as config the configuration file created by daijin assemble
, which will be located in <output directory>/mikado.yaml.
Mikado pipeline, as driven by Daijin
Hint
If you have already created some assemblies and wish to analyse them with Daijin, it is also possible to configure Mikado externally and use the resulting configuration file to guide Daijin. At the time of this writing, this is also the recommended protocol for including eg Pacbio or EST alignments.
Mikado provides a pipeline manager, Daijin, to align and assemble transcripts with multiple methods and subsequently choose the best assemblies among the options. The pipeline is implemented using Snakemake [Snake].
Miscellaneous utilities¶
Mikado miscellaneous scripts¶
All these utilities can be accessed with the mikado util
CLI. They perform relatively minor tasks.
This utility is used to retrieve specific regions from a GTF file, without breaking any transcript feature. Any transcript falling even partly within the specified coordinates will be retained in its entirety.
Usage:
$ mikado util awk_gtf --help
usage: mikado.py util awk_gtf [-h] (-r REGION | --chrom CHROM) [-as]
[--start START] [--end END]
gtf [out]
positional arguments:
gtf
out
optional arguments:
-h, --help show this help message and exit
-r REGION, --region REGION
Region defined as a string like <chrom>:<start>..<end>
--chrom CHROM
-as, --assume-sorted
--start START
--end END``
This utility is used to obtain information about any class code or category thereof.
Usage:
$ mikado util class_codes --help
usage: mikado util class_codes [-h]
[-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
[-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]]
[-o OUT]
[{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} ...]]
Script to print out the class codes.
positional arguments:
{[],=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u}
Codes to query.
optional arguments:
-h, --help show this help message and exit
-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...], --category {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]
-o OUT, --out OUT
This utility is used to convert between GTF and GFF3 files, with the possibility of giving as output BED12 files as well. It is limited to converting transcript features, and will therefore ignore any other feature present (transposons, loci, etc.). The output of the conversion to GFF3 is completely GFF3 compliant.
Usage:
$ mikado util convert --help
usage: mikado.py util convert [-h] [-of {bed12,gtf,gff3}] gf [out]
positional arguments:
gf
out
optional arguments:
-h, --help show this help message and exit
-of {bed12,gtf,gff3}, --out-format {bed12,gtf,gff3}
This utility extracts specific transcripts and genes from an input GTF/GFF3 file. As input, it requires a text file of either the format “<transcript id><tab><gene id>”, or simply gene per line (in which case the “–genes” switch has to be invoked). If only some of the transcripts of a gene are included in the text file, the gene feature will be shrunk accordingly. The name is an obvious homage to the invaluable UNIX command that we all love.
Usage:
$ mikado util grep --help
usage: mikado.py util grep [-h] [-v] [--genes] ids gff [out]
Script to extract specific models from GFF/GTF files.
positional arguments:
ids ID file (format: mrna_id, gene_id - tab separated)
gff The GFF file to parse.
out Optional output file
optional arguments:
-h, --help show this help message and exit
-v Exclude from the gff all the records in the id file.
--genes Flag. If set, the program expects as ids only a list of genes,
and will exclude/include all the transcripts children of the
selected genes.
This script merges together various XML BLAST+ files into a single entity. It might be of use when the input data has been chunked into different FASTA files for submission to a cluster queue. It is also capable of converting from ASN files and of dealing with GZipped files.
Usage:
$ mikado util merge_blast --help
usage: mikado.py util merge_blast [-h] [-v] [-l LOG] [--out [OUT]]
xml [xml ...]
positional arguments:
xml
optional arguments:
-h, --help show this help message and exit
-v, --verbose
-l LOG, --log LOG
--out [OUT]
This command generates the documentation regarding the available transcript metrics. It is generated dynamycally by inspecting the code. The documentation in the introduction is generated using this utility.
Usage:
$ mikado util metrics --help
usage: mikado util metrics [-h]
[-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
[-o OUT]
[-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]]
[metric [metric ...]]
Simple script to obtain the documentation on the transcript metrics.
positional arguments:
metric
optional arguments:
-h, --help show this help message and exit
-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
Format of the table to be printed out.
-o OUT, --out OUT Optional output file
-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...], --category {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]
Available categories to select from.
This command generates a statistics file for GFF3/GTF files. The output is a table including Average, Mode, and various quantiles for different features present in a typical GFF file (genes, introns, exons, cDNAs, etc.). The operation can be quite time consuming for large files, in which case it is advisable to ask for multiple processors.
Usage:
$ mikado util stats --help
usage: mikado.py util stats [-h] [--only-coding] [-p PROCS] gff [out]
GFF/GTF statistics script. It will compute median/average length of RNAs,
exons, CDS features, etc.
positional arguments:
gff GFF file to parse.
out
optional arguments:
-h, --help show this help message and exit
--only-coding
-p PROCS, --processors PROCS
A typical example statistics file can be found here, for the TAIR10 annotation
.
This utility trims down the terminal exons of multiexonic transcripts, until either shrinking them to the desired maximum length or meeting the beginning/end of the CDS. It has been used for generating the “trimmed” annotations for the analysis of the original Mikado paper.
Usage:
$ mikado util trim --help
usage: mikado.py util trim [-h] [-ml MAX_LENGTH] [--as-gtf] ann [out]
positional arguments:
ann Reference GTF/GFF output file.
out
optional arguments:
-h, --help show this help message and exit
-ml MAX_LENGTH, --max_length MAX_LENGTH
Maximal length of trimmed terminal exons
--as-gtf Flag. If set, the output will be in GTF rather than
GFF3 format.
Included scripts¶
All the following scripts are included in the “util” folder in the source code, and will be included on the PATH after installation. Some of this scripts are used by the The Daijin pipeline for driving Mikado pipeline to produce statistics or perform other intermediate steps.
This script is needed to add a top-level transcript feature to GTFs that lack it, eg. those produced by CuffMerge [CuffMerge].
Usage:
$ add_transcript_feature_to_gtf.py --help
usage: Script to add a transcript feature to e.g. Cufflinks GTFs
[-h] gtf [out]
positional arguments:
gtf Input GTF
out Output file. Default: stdout.
optional arguments:
-h, --help show this help message and exit
This script is used to collect statistics from samtools stat. Usage:
$ align_collect.py --help
usage: Script to collect info from multiple samtools stats files
[-h] input [input ...]
positional arguments:
input The list of samtools stats file to process
optional arguments:
-h, --help show this help message and exit
This script is used to collect statistics obtained with from the mikado util stats utility. Output is printed directly to the screen. Usage:
$ asm_collect.py -h
usage: Script to collect info from multiple mikado util stats files
[-h] input [input ...]
positional arguments:
input The list of mikado util stats file to process
optional arguments:
-h, --help show this help message and exit
This script will use PySam to convert read alignments into a GTF file. Mostly useful to convert from BAM alignment of long reads (eg. PacBio) into a format which Mikado can interpret and use.
Usage:
$ bam2gtf.py --help
usage: Script to convert from BAM to GTF, for PB alignments [-h] bam [out]
positional arguments:
bam Input BAM file
out Optional output file
optional arguments:
-h, --help show this help message and exit
Python3 wrapper for the CLASS [Class2] assembler. It will perform the necessary operations for the assembler (depth and call of the splicing junctions), and launch the program itself. Usage:
$ class_run.py --help
usage: Quick utility to rewrite the wrapper for CLASS. [-h] [--clean]
[--force]
[-c CLASS_OPTIONS]
[-p PROCESSORS]
[--class_help] [-v]
[bam] [out]
positional arguments:
bam Input BAM file.
out Optional output file.
optional arguments:
-h, --help show this help message and exit
--clean Flag. If set, remove tepmorary files.
--force Flag. If set, it forces recalculation of all
intermediate files.
-c CLASS_OPTIONS, --class_options CLASS_OPTIONS
Additional options to be passed to CLASS. Default: no
additional options.
-p PROCESSORS, --processors PROCESSORS
Number of processors to use with class.
--class_help If called, the wrapper will ask class to display its
help and exit.
-v, --verbose
Script to extract a list of sequences from a FASTA file, using the pyfaidx [PyFaidx] module. Usage:
$ getFastaFromIds.py -h
usage: getFastaFromIds.py [-h] [-v] list fasta [out]
A simple script that retrieves the FASTA sequences from a file given a list of
ids.
positional arguments:
list File with the list of the ids to recover, one by line.
Alternatively, names separated by commas.
fasta FASTA file.
out Optional output file.
optional arguments:
-h, --help show this help message and exit
-v, --reverse Retrieve entries which are not in the list, as in grep -v (a
homage).
Script to convert a GFF junction file to a BED12 file. Useful to format the input for Mikado serialise.
Usage:
$ gffjunc_to_bed12.py --help
usage: GFF=>BED12 converter [-h] gff [out]
positional arguments:
gff
out
optional arguments:
-h, --help show this help message and exit
A script to extract data from column files, using a list of targets. More efficient than a standard “grep -f” for this niche case.
Usage:
$ util/grep.py -h
usage: grep.py [-h] [-v] [-s SEPARATOR] [-f FIELD] [-q] ids target [out]
This script is basically an efficient version of the GNU "grep -f" utility for
table-like files, and functions with a similar sintax.
positional arguments:
ids The file of patterns to extract
target The file to filter
out The output file
optional arguments:
-h, --help show this help message and exit
-v, --reverse Equivalent to the "-v" grep option
-s SEPARATOR, --separator SEPARATOR
The field separator. Default: consecutive
whitespace(s)
-f FIELD, --field FIELD
The field to look in the target file.
-q, --quiet No logging.
This script will merge [Portcullis]-like junctions into a single BED12, using the thick start/ends as unique keys.
Usage:
$ merge_junction_bed12.py --help
usage: Script to merge BED12 files *based on the thickStart/End features*.
Necessary for merging junction files such as those produced by TopHat
[-h] [--delim DELIM] [-t THREADS] [--tophat] [-o OUTPUT] bed [bed ...]
positional arguments:
bed Input BED files. Use "-" for stdin.
optional arguments:
-h, --help show this help message and exit
--delim DELIM Delimiter for merged names. Default: ;
-t THREADS, --threads THREADS
Number of threads to use for multiprocessing. Default:
1
--tophat Flag. If set, tophat-like junction style is assumed.
This means that junctions are defined using the
blockSizes rather than thickStart/End. The script will
convert the lines to this latter format. By default,
the script assumes that the intron start/end are
defined using thickStart/End like in portcullis.
Mixed-type input files are not supported.
-o OUTPUT, --output OUTPUT
Output file. Default: stdout
Quick script to remove sequences from a given organism from SwissProt files, and print them out in FASTA format. Used to produce the BLAST datasets for the Mikado paper. Usage:
$ remove_from_embl.py -h
usage: Script to remove sequences specific of a given organism from a SwissProt file.
[-h] -o ORGANISM [--format {fasta}] input [out]
positional arguments:
input
out
optional arguments:
-h, --help show this help message and exit
-o ORGANISM, --organism ORGANISM
Organism to be excluded
--format {fasta} Output format. Choices: fasta. Default: fasta.
Simple script to clean the header of FASTA files, so to avoid runtime errors and incrongruencies with BLAST and other tools which might be sensitive to long descriptions or the presence of special characters.
Usage:
$ sanitize_blast_db.py --help
usage: sanitize_blast_db.py [-h] [-o OUT] fasta [fasta ...]
positional arguments:
fasta
optional arguments:
-h, --help show this help message and exit
-o OUT, --out OUT
This script is used to split a FASTA file in a fixed number of files, with an approximate equal number of sequences in each. If the number of sequences in the input file is lower than the number of requested splits, the script will create the necessary number of empty files. Used in The Daijin pipeline for driving Mikado for preparing the input data for the BLAST analysis. Usage:
$ split_fasta.py --help
usage: Script to split FASTA sequences in a fixed number of multiple files.
[-h] [-m NUM_FILES] fasta [out]
positional arguments:
fasta Input FASTA file.
out Output prefix. Default: filename+split
optional arguments:
-h, --help show this help message and exit
-m NUM_FILES, --num-files NUM_FILES
Number of files to create. Default: 1000
This script parses an annotation file and truncates any transcript which has UTR introns over the provided threshold. In such cases, the UTR section after the long intron is simply removed. Usage:
$ trim_long_introns.py --help
usage: This script truncates transcript with UTR exons separated by long introns.
[-h] [-mi MAX_INTRON] gff [out]
positional arguments:
gff
out
optional arguments:
-h, --help show this help message and exit
-mi MAX_INTRON, --max-intron MAX_INTRON
Maximum intron length for UTR introns.
Finally, Mikado provides some dedicated utilities to perform common tasks.
- Some of the utilities are integral to the Mikado suite and can be accessed as subcommands of Mikado. These utilities comprise programs to calculate annotation statistics, retrieve or exclude specific loci from a file, etc.
- Other utilities are provided as stand-alone scripts; while some of them directly depend on the Mikado library, this is not necessarily the case for them all.
Mikado core algorithms¶
Picking transcripts: how to define loci and their members¶
The Mikado pick algorithm
Schematic representation of how Mikado unbundles a complex locus into two separate genes.
Transcripts are scored and selected according to user-defined rules, based on many different features of the transcripts themselves (cDNA length, CDS length, UTR fraction, number of reliable junctions, etc.; please see the dedicated section on scoring for details on the scoring algorithm).
The detection and analysis of a locus proceeds as follows:
When the first transcript is detected, Mikado will create a superlocus - a container of transcripts sharing the same genomic location - and assign the transcript to it.
While traversing the genome, as long as any new transcript is within the maximum allowed flanking distance, it will be added to the superlocus.
- When the last transcript is added, Mikado performs the following preliminary operations:
- Integrate all the data from the database (including ORFs, reliable junctions in the region, and BLAST homology).
- If a transcript is monoexonic, assign or reverse its strand if the ORF data supports the decision
- If requested and the ORF data supports the operation, split chimeric transcripts - ie those that contain two or more non-overlapping ORFs on the same strand.
- Split the superlocus into groups of transcripts that:
- share the same strand
- have at least 1bp overlap
- Analyse each of these novel “stranded” superloci separately.
- Create subloci, ie group transcripts so to minimize the probability of mistakenly merging multiple gene loci due to chimeras. These groups are defined as follows:
- if the transcripts are multiexonic, they must share at least one intron, inclusive of the borders
- if the transcripts are monoexonic, they must overlap by at least 1bp.
- Monoexonic and multiexonic transcripts cannot be part of the same sublocus.
- Select the best transcript inside each sublocus:
- Score the transcripts (see the section on scoring)
- Select as winner the transcript with the highest score and assign it to a monosublocus
- Discard any transcript which is overlapping with it, according to the definitions in the point above
- Repeat the procedure from point 2 until no transcript remains in the sublocus
- Monosubloci are gathered together into monosubloci holders, ie the seeds for the gene loci. Monosubloci holders have more lenient parameters to group transcripts, as the first phase should have already discarded most chimeras. Once a holder is created by a single monosublocus, any subsequent candidate monosublocus will be integrated only if the following conditions are satisfied:
- if the candidate is monoexonic, its exon must overlap at least one exon of a transcript already present in the holder
- if the candidate is multiexonic and the holder contains only monoexonic transcripts, apply the same criterion, ie check whether its exons overlap the exons of at least one of the transcripts already present
- if the candidate is multiexonic and the holder contains multiexonic transcripts, check whether one of the following conditions is satisfied:
- at least one intron of the candidate overlaps with an intron of a transcript in the holder
- at least one intron of the candidate is completely contained within an exon of a transcript in the holder
- at least one intron of a transcript in the holder is completely contained within an exon of a transcript in the holder.
- the cDNA overlap and CDS overlap between the candidate and the transcript in the holder are over a specified threshold.
Optionally, it is possible to tell Mikado to use a simpler algorithm, and integrate together all transcripts that share exon space. Such a simpler algorithm risks, however, chaining together multiple loci - especially in small, compact genomes.
Once the holders are created, apply the same scoring and selection procedure of the sublocus selection step. The winning transcripts are assigned to the final loci. These are called the primary transcripts of the loci.
- Once the loci are created, track back to the original transcripts of the superlocus:
- discard any transcript overlapping more than one locus, as these are probably chimeras.
- For those transcripts that are overlapping to a single locus, verify that they are valid alternative splicing events using the class code of the comparison against the primary transcript. Transcripts are re-scored dynamically when they are re-added in this fashion, to ensure their quality when compared with the primary transcript.
- Finally detect and either tag or discard fragments inside the initial superlocus (irrespective of strand):
- Check whether the primary transcript of any locus meets the criteria to be defined as a fragment (by default, maximum ORF of 30AA and maximum 2 exons - any transcript exceeding either criterion will be considered as non-fragment by default)
- If so, verify whether they are near enough any valid locus to be considered as a fragment (in general, class codes which constitute the “Intronic”, “Fragmentary” and “No overlap” categories).
- If these conditions are met, tag the locus as a fragment. If requested, Mikado will just discard these transcripts (advised).
These steps help Mikado identify and solve fusions, detect correctly the gene loci, and define valid alternative splicing events.
Transcript measurements and scoring¶
In order to determine the best transcript for each locus, Mikado measures each available candidate according to various different metrics and assigning a specific score for each of those. Similarly to RAMPART [Rampart], Mikado will assign a score to each transcript for each metric by assessing it relatively to the other transcripts in the locus. The particular feature rescaling equation used for a given metric depends on the type of feature it represents:
- metrics where higher values represent better transcript assemblies (“maximum”).
- metrics where lower values represent better transcript assemblies (“minimum”)
- metrics where values closer to a defined value represent better assemblies (“target”)
To allow for this tripartite scoring system with disparate input values, we have to employ rescaling equations so that each metric for each transcript will be assigned a score between 0 and 1. Optionally, each metric might be assigned a greater weight so that its maximum possible value will be greater or smaller than 1. Formally, let metric \(m\) be one of the available metrics \(M\), \(t\) a transcript in locus \(L\), \(w_{m}\) the weight assigned to metric \(m\), and \(r_{mt}\) the raw value of metric \(m\) for \(t\). Then, the score to metric \(m\) for transcript \(t\), \(s_{mt}\), will be derived using one of the following three different rescaling equations:
- If higher values are best:
- \(s_{mt} = w_{m} * (\frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
- If lower values are best:
- \(s_{mt} = w_{m} * (1 - \frac{r_{mt} - min(r_m)}{max(r_m)-min(r_m)})\)
- If values closer to a target \(v_{m}\) are best:
- \(s_{mt} = w_{m} * (1 - \frac{|r_{mt} - v_{m}|}{max(|r_{m} - v_{m}|)})\)
- Finally, the scores for each metric will be summed up to produce a final score for the transcript:
- \(s_{t} = \sum_{m \forall m \in M} s_{mt}\).
Not all the available metrics will be necessarily used for scoring; the choice of which to employ and how to score and weight each of them is left to the experimenter, although Mikado provides some pre-configured scoring files.
Important
The scoring algorithm is dependent on the other transcripts in the locus, so each score should not be taken as an absolute measure of the reliability of a transcript, but rather as a measure of its relative goodness compared with the alternatives. Shifting a transcript from one locus to another can have dramatic effects on the scoring of a transcript, even while the underlying metric values remain unchanged. This is why the score assigned to each transcript changes throughout the Mikado run, as transcripts are moved to subloci, monoloci and finally loci.
Note
Starting from beta8, Mikado allows for metrics whose value range is between 0 and 1 to be used directly as scores.
Scoring files¶
Mikado employs user-defined configuration files to define the desirable features in genes. These files are in either YAML or JSON format (default YAML) and are composed of four sections:
- a requirements section, specifying the minimum requirements that a transcript must satisfy to be considered as valid. Any transcript failing these requirements will be scored at 0 and purged.
- a not_fragmentary section, specifying the minimum requirements that the primary transcript of a locus has to satisfy in order for the locus not to be considered as a putative fragment.
- an as_requirements section, which specifies the minimum requirements for transcripts for them to be considered as possible valid alternative splicing events.
- a scoring section, specifying which features Mikado should look for in transcripts, and how each of them will be weighted.
Conditions are specified using a strict set of available operators and the values they have to consider.
Important
Although at the moment Mikado does not offer any method to establish machine-learning based scoring configurations, it is a topic we plan to investigate in the future. Mikado already supports Random Forest Regressors as scorers through Scikit-learn, but we have yet to devise a proper way to create such regressors.
Operators¶
Mikado allows the following operators to express a relationship inside the scoring files:
- eq: equal to (\(=\)). Valid for comparisons with numbers, boolean values, and strings.
- ne: different from (\(\neq\)). Valid for comparisons with numbers, boolean values, and strings.
- lt: less than (\(<\)). Valid for comparisons with numbers.
- gt: greater than (\(>\)). Valid for comparisons with numbers.
- le: less or equal than (\(\le\)). Valid for comparisons with numbers.
- ge: greater or equal than (\(\ge\)). Valid for comparisons with numbers.
- in: member of (\(\in\)). Valid for comparisons with arrays or sets.
- not in: not member of (\(\notin\)). Valid for comparisons with arrays or sets.
- within: value comprised in the range of the two values, inclusive.
- not within: value not comprised in the range of the two values, inclusive.
Mikado will fail if an operator not present on this list is specified, or if the operator is assigned to compare against the wrong data type (eg. eq with an array).
The “requirements”, “as_requirements” and “not_fragmentary” sections¶
These sections specifies the minimum requirements for a transcript at various stages. * A transcript failing to pass the requirements check will be discarded outright (if “purge” is selected) or given a score of 0 otherwise. * If a transcript has not been selected as the primary transcript of a locus, it has to pass the as_requirements check to be considered as a valid alternative splicing event. * Finally, after loci have been defined, the primary transcripts of loci that do not pass the not_fragmentary section mark their loci to be compared against neighbouring loci which have passed this same check.
It is strongly advised to use lenient parameters in the requirements section, as failing to do so might result in discarding whole loci. Typically, transcripts filtered at this step should be obvious fragments, eg monoexonic transcripts produced by RNA-Seq with a total length lower than the library fragment length. This section is composed by two parts:
- parameters: a list of the metrics to be considered. Each metric can be considered multiple times, by suffixing it with a “.<id>” construct (eg cdna_length.*mono* vs. cdna_length.*multi* to distinguish two uses of the cdna_length metric - once for monoexonic and once for multiexonic transcripts). Any parameter which is not a valid metric name, after removal of the suffix, will cause an error. Parameters have to specify the following:
- a value that the metric has to be compared against
- an operator that specifies the target operation. See the operators section.
- expression: a string array that will be compiled into a valid boolean expression. All the metrics present in the expression string must be present in the parameters section. If an unrecognized metric is present, Mikado will crash immediately, complaining that the scoring file is invalid. Apart from brackets, Mikado accepts only the following boolean operators to chain the metrics:
- and
- or
- not
- xor
Hint
if no expression is specified, Mikado will construct one by chaining all the provided parameters with and and operator. Most of the time, this would result in an unexpected behaviour - ie Mikado assigning a score of 0 to most transcripts. It is strongly advised to explicitly provide a valid expression.
As an example, the following snippet replicates a typical requirements section found in a scoring file:
requirements:
expression: [((exon_num.multi and cdna_length.multi and max_intron_length and min_intron_length), or,
(exon_num.mono and cdna_length.mono))]
parameters:
cdna_length.mono: {operator: gt, value: 50}
cdna_length.multi: {operator: ge, value: 100}
exon_num.mono: {operator: eq, value: 1}
exon_num.multi: {operator: gt, value: 1}
max_intron_length: {operator: le, value: 20000}
min_intron_length: {operator: ge, value: 5}
- In order:
- In the parameters section, we ask for the following:
- exon_num.mono: monoexonic transcripts must have one exon (“eq”)
- exon_num.multi: multiexonic transcripts must have more than one exon (“gt”)
- cdna_length.mono: monoexonic transcripts must have a length greater than 50 bps (the “.mono” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
- cdna_length.multi: multiexonic transcripts must have a length greater than or equal to 100 bps (the “.multi” suffix is arbitrary, as long as it is unique for all calls of cdna_length)
- max_intron_length: multiexonic transcripts should not have any intron longer than 200,000 bps.
- min_intron_length: multiexonic transcripts should not have any intron smaller than 5 bps.
the expression field will be compiled into the following expression:
(exon_num > 1 and cdna_length >= 100 and max_intron_length <= 200000 and min_intron_length >= 5) or (exon_num == 1 and cdna_length > 50)
Any transcript for which the expression evaluates to \(False\) will be assigned a score of 0 outright and discarded, unless the user has chosen to disable the purging of such transcripts.
The scoring section¶
This section specifies which metrics will be used by Mikado to score the transcripts. Each metric to be used is specified as a subsection of the configuration, and will have the following attributes:
- rescaling: the only compulsory attribute. It specifies the kind of scoring that will be applied to the metric, and it has to be one of “max”, “min”, or “target”. See the explanation on the scoring algorithm for details.
- value: compulsory if the chosen rescaling algorithm is “target”. This should be either a number or a boolean value.
- multiplier: the weight assigned to the metric in terms of scoring. This parameter is optional; if absent, as it is in the majority of cases, Mikado will consider the multiplier to equal to 1. This is the \(w_{m}\) element in the equations above.
- filter: It is possible to specify a filter which the metric has to fulfill to be considered for scoring, eg, “cdna_length >= 200”. If the transcript fails to pass this filter, the score for this metric only will be set to 0. A “filter” subsection has to specify the following:
- operator: the operator to apply for the boolean expression. See the relative section.
- value: value that will be used to assess the metric.
Hint
the purpose of the filter section is to allow for fine-tuning of the scoring mechanism; ie it allows to penalise transcripts with undesirable qualities (eg a possible retained intron) without discarding them outright. As such, it is a less harsh version of the requirements section and it is the preferred way of specifying which transcript features Mikado should be wary of.
For example, this is a snippet of a scoring section:
scoring:
blast_score: {rescaling: max}
cds_not_maximal: {rescaling: min}
combined_cds_fraction: {rescaling: target, value: 0.8, multiplier: 2}
five_utr_length:
filter: {operator: le, value: 2500}
rescaling: target
value: 100
end_distance_from_junction:
filter: {operator: lt, value: 55}
rescaling: min
Using this snippet as a guide, Mikado will score transcripts in each locus as follows:
- Assign a full score (one point, as no multiplier is specified) to transcripts which have the greatest blast_score
- Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest amount of CDS bases in secondary ORFs (cds_not_maximal)
- Assign a full score (two points, as a multiplier of 2 is specified) to transcripts that have a total amount of CDS bps approximating 80% of the transcript cDNA length (combined_cds_fraction)
- Assign a full score (one point, as no multiplier is specified) to transcripts that have a 5’ UTR whose length is nearest to 100 bps (five_utr_length); if the 5’ UTR is longer than 2,500 bps, this score will be 0 (see the filter section)
- Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest distance between the CDS end and the most downstream exon-exon junction (end_distance_from_junction). If such a distance is greater than 55 bps, assign a score of 0, as it is a probable target for NMD (see the filter section).
Metrics¶
These are all the metrics available to quantify transcripts. The documentation for this section has been generated using the metrics utility.
Metrics belong to one of the following categories:
- Descriptive: these metrics merely provide a description of the transcript (eg its ID) and are not used for scoring.
- cDNA: these metrics refer to basic features of any transcript such as its number of exons, its cDNA length, etc.
- Intron: these metrics refer to features related to the number of introns and their lengths.
- CDS: these metrics refer to features related to the CDS assigned to the transcript.
- UTR: these metrics refer to features related to the UTR of the transcript. In the case in which a transcript has been assigned multiple ORFs, unless otherwise stated the UTR metrics will be derived only considering the selected ORF, not the combination of all of them.
- Locus: these metrics refer to features of the transcript in relationship to all other transcripts in its locus, eg how many of the introns present in the locus are present in the transcript. These metrics are calculated by Mikado during the picking phase, and as such their value can vary during the different stages as the transcripts are shifted to different groups.
- External: these metrics are derived from accessory data that is recovered for the transcript during the run time. Examples include data regarding the number of introns confirmed by external programs such as PortCullis, or the BLAST score of the best hits.
Hint
Starting from version 1 beta8, Mikado allows to use externally defined metrics for the transcripts. These can be accessed using the keyword “external.<name of the metrics>” within the configuration file. See the relevant section for details.
Important
Starting from Mikado 1 beta 8, it is possible to use metrics with values between 0 and 1 directly as scores, without rescaling. This feature is available only for metrics whose values naturally lie between 0 and 1, or that are boolean in nature.
Metric name | Description | Category | Data type | Usable raw |
---|---|---|---|---|
tid | ID of the transcript - cannot be an undefined value. Alias of id. | Descriptive | str | False |
parent | Name of the parent feature of the transcript. | Descriptive | str | False |
score | Numerical value which summarizes the reliability of the transcript. | Descriptive | str | False |
external_scores | SPECIAL this Namespace contains all the information regarding external scores for the transcript. If an absent property is not defined in the Namespace, Mikado will set a default value of 0 into the Namespace and return it. | External | Namespace | True |
best_bits | Metric that returns the best BitS associated with the transcript. | External | float | False |
blast_identity | This metric will return the alignment identity for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: | External | float | True |
blast_query_coverage | This metric will return the query coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: | External | float | True |
blast_score | Interchangeable alias for testing different blast-related scores. Current: best bit score. | External | float | False |
blast_target_coverage | This metric will return the target coverage for the best BLAST hit according to the evalue. If no BLAST hits are available for the sequence, it will return 0. :return: :return: | External | float | True |
canonical_intron_proportion | This metric returns the proportion of canonical introns of the transcript on its total number of introns. | Intron | float | True |
cdna_length | This property returns the length of the transcript. | cDNA | int | False |
cds_not_maximal | This property returns the length of the CDS excluded from the selected ORF. | CDS | int | False |
cds_not_maximal_fraction | This property returns the fraction of bases not in the selected ORF compared to the total number of CDS bases in the cDNA. | CDS | float | True |
combined_cds_fraction | This property return the percentage of the CDS part of the transcript vs. the cDNA length | CDS | float | True |
combined_cds_intron_fraction | This property returns the fraction of CDS introns of the transcript vs. the total number of CDS introns in the Locus. If the transcript is by itself, it returns 1. | Locus | True | |
combined_cds_length | This property return the length of the CDS part of the transcript. | CDS | int | False |
combined_cds_locus_fraction | This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. | Locus | float | True |
combined_cds_num | This property returns the number of non-overlapping CDS segments in the transcript. | CDS | int | False |
combined_cds_num_fraction | This property returns the fraction of non-overlapping CDS segments in the transcript vs. the total number of exons | CDS | float | True |
combined_utr_fraction | This property returns the fraction of the cDNA which is not coding according to any ORF. Complement of combined_cds_fraction | UTR | float | True |
combined_utr_length | This property return the length of the UTR part of the transcript. | UTR | int | False |
end_distance_from_junction | This metric returns the cDNA distance between the stop codon and the last junction in the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. This metric considers the combined CDS end. | CDS | int | False |
end_distance_from_tes | This property returns the distance of the end of the combined CDS from the transcript end site. If no CDS is defined, it defaults to 0. | CDS | int | False |
exon_fraction | This property returns the fraction of exons of the transcript which are contained in the sublocus. If the transcript is by itself, it returns 1. Set from outside. | Locus | float | True |
exon_num | This property returns the number of exons of the transcript. | cDNA | int | False |
five_utr_length | Returns the length of the 5’ UTR of the selected ORF. | UTR | False | |
five_utr_num | This property returns the number of 5’ UTR segments for the selected ORF. | UTR | int | False |
five_utr_num_complete | This property returns the number of 5’ UTR segments for the selected ORF, considering only those which are complete exons. | UTR | int | False |
has_start_codon | Boolean. True if the selected ORF has a start codon. | CDS | bool | False |
has_stop_codon | Boolean. True if the selected ORF has a stop codon. | CDS | bool | False |
highest_cds_exon_number | This property returns the maximum number of CDS segments among the ORFs; this number can refer to an ORF DIFFERENT from the maximal ORF. | CDS | int | False |
highest_cds_exons_num | Returns the number of CDS segments in the selected ORF (irrespective of the number of exons involved) | CDS | int | False |
intron_fraction | This property returns the fraction of introns of the transcript vs. the total number of introns in the Locus. If the transcript is by itself, it returns 1. Set from outside. | Locus | float | True |
is_complete | Boolean. True if the selected ORF has both start and end. | CDS | bool | False |
max_exon_length | This metric will return the length of the biggest exon in the transcript. | cDNA | int | False |
max_intron_length | This property returns the greatest intron length for the transcript. | Intron | int | False |
min_exon_length | This metric will return the length of the biggest exon in the transcript. | False | ||
min_intron_length | This property returns the smallest intron length for the transcript. | Intron | int | False |
non_verified_introns_num | This metric returns the number of introns of the transcript which are not validated by external data. | External | int | False |
num_introns_greater_than_max | This metric returns the number of introns greater than the maximum acceptable intron size indicated in the constructor. | Intron | int | False |
num_introns_smaller_than_min | This metric returns the number of introns smaller than the mininum acceptable intron size indicated in the constructor. | Intron | int | False |
number_internal_orfs | This property returns the number of ORFs inside a transcript. | CDS | int | False |
only_non_canonical_splicing | This metric will return True if the canonical_number is 0 | Intron | bool | False |
proportion_verified_introns | This metric returns, as a fraction, how many of the transcript introns are validated by external data. Monoexonic transcripts are set to 1. | External | float | True |
proportion_verified_introns_inlocus | This metric returns, as a fraction, how many of the verified introns inside the Locus are contained inside the transcript. | Locus | float | True |
retained_fraction | This property returns the fraction of the cDNA which is contained in retained introns. | Locus | float | True |
retained_intron_num | This property records the number of introns in the transcripts which are marked as being retained. See the corresponding method in the sublocus class. | Locus | int | False |
selected_cds_exons_fraction | Returns the fraction of CDS segments in the selected ORF (irrespective of the number of exons involved) | CDS | float | True |
selected_cds_fraction | This property calculates the fraction of the selected CDS vs. the cDNA length. | CDS | float | True |
selected_cds_intron_fraction | This property returns the fraction of CDS introns of the selected ORF of the transcript vs. the total number of CDS introns in the Locus (considering only the selected ORF). If the transcript is by itself, it should return 1. | CDS | float | True |
selected_cds_length | This property calculates the length of the CDS selected as best inside the cDNA. | CDS | int | False |
selected_cds_locus_fraction | This metric returns the fraction of CDS bases of the transcript vs. the total of CDS bases in the locus. | Locus | float | True |
selected_cds_num | This property calculates the number of CDS exons for the selected ORF | CDS | int | False |
selected_cds_number_fraction | This property returns the proportion of best possible CDS segments vs. the number of exons. See selected_cds_number. | CDS | float | False |
selected_end_distance_from_junction | This metric returns the distance between the stop codon and the last junction of the transcript. In many eukaryotes, this distance cannot exceed 50-55 bps, otherwise the transcript becomes a target of NMD. If the transcript is not coding or there is no junction downstream of the stop codon, the metric returns 0. | CDS | int | False |
selected_end_distance_from_tes | This property returns the distance of the end of the best CDS from the transcript end site. If no CDS is defined, it defaults to 0. | CDS | int | False |
selected_start_distance_from_tss | This property returns the distance of the start of the best CDS from the transcript start site. If no CDS is defined, it defaults to 0. | CDS | int | False |
snowy_blast_score | Metric that indicates how good a hit is compared to the competition, in terms of BLAST similarities. As in SnowyOwl, the score for each hit is calculated by taking the coverage of the target and dividing it by (2 * len(self.blast_hits)). IMPORTANT: when splitting transcripts by ORF, a blast hit is added to the new transcript only if it is contained within the new transcript. This WILL screw up a bit the homology score. | External | float | False |
source_score | This metric returns a score that is assigned to the transcript in virtue of its origin. | External | int | False |
start_distance_from_tss | This property returns the distance of the start of the combined CDS from the transcript start site. If no CDS is defined, it defaults to 0. | CDS | int | False |
suspicious_splicing | This metric will return True if the transcript either has canonical introns on both strands (probably a chimeric artifact between two neighbouring loci, or if it has no canonical splicing event but it would if it were assigned to the opposite strand (probably a strand misassignment on the part of the assembler/predictor). | Intron | bool | False |
three_utr_length | Returns the length of the 5’ UTR of the selected ORF. | int | False | |
three_utr_num | This property returns the number of 3’ UTR segments (referred to the selected ORF). | UTR | int | False |
three_utr_num_complete | This property returns the number of 3’ UTR segments for the selected ORF, considering only those which are complete exons. | UTR | int | False |
utr_fraction | This property calculates the length of the UTR of the selected ORF vs. the cDNA length. | UTR | float | True |
utr_length | Returns the sum of the 5’+3’ UTR lengths | UTR | int | False |
utr_num | Returns the number of UTR segments (referred to the selected ORF). | UTR | int | False |
utr_num_complete | Returns the number of UTR segments which are complete exons (referred to the selected ORF). | UTR | int | False |
verified_introns_num | This metric returns the number of introns of the transcript which are validated by external data. | External | int | False |
External metrics¶
Starting from version 1 beta 8, Mikado allows to load external metrics into the database, to be used for evaluating transcripts. Metrics of this kind must have a value comprised between 0 and 1.
Technical details¶
Most of the selection (ie “pick”) stage of the pipeline relies on the implementation of the objects in the loci submodule. In particular, the library defines an abstract class, “Abstractlocus”, which requires all its children to implement a version of the “is_intersecting” method. Each implementation of the method is specific to the stage. So the superlocus class will require in the “is_intersecting” method only overlap between the transcripts, optionally with a flanking and optionally restricting the groups to transcripts that share the same strand. The sublocus class will implement a different algorithm, and so on. The scoring is effectuated by first asking to recalculate the metrics (.calculate_metrics) and subsequently to calculate the scores (.calculate_scores). Mikado will try to cache and avoid recalculation of metrics and scores as much as possible, to make the program faster.
Metrics are an extension of the property
construct in Python3. Compared to normal properties, they are distinguished only by three optional descriptive attributes: category
, usable_raw
, and rtype
. The main reason to subclass property
is to allow Mikado to be self-aware of which properties will be used for scoring transcripts, and which will not. So, for example, in the following snippet from the Transcript class definition:
@property
def combined_cds(self):
"""This is a list which contains all the non-overlapping CDS
segments inside the cDNA. The list comprises the segments
as duples (start,end)."""
return self.__combined_cds
@combined_cds.setter
def combined_cds(self, combined):
"""
Setter for combined_cds. It performs some basic checks,
e.g. that all the members of the list are integer duplexes.
:param combined: list
:type combined: list[(int,int)]
"""
if ((not isinstance(combined, list)) or
any(self.__wrong_combined_entry(comb) for comb in combined)):
raise TypeError("Invalid value for combined CDS: {0}".format(combined))
@Metric
def combined_cds_length(self):
"""This property return the length of the CDS part of the transcript."""
c_length = sum([c[1] - c[0] + 1 for c in self.combined_cds])
if len(self.combined_cds) > 0:
assert c_length > 0
return c_length
combined_cds_length.category = "CDS"
@Metric
def combined_cds_num(self):
"""This property returns the number of non-overlapping CDS segments
in the transcript."""
return len(self.combined_cds)
combined_cds_num.category = "CDS"
@Metric
def has_start_codon(self):
"""Boolean. True if the selected ORF has a start codon.
:rtype: bool"""
return self.__has_start_codon
@has_start_codon.setter
def has_start_codon(self, value):
"""Setter. Checks that the argument is boolean.
:param value: boolean flag
:type value: bool
"""
if value not in (None, False, True):
raise TypeError(
"Invalid value for has_start_codon: {0}".format(type(value)))
self.__has_start_codon = value
has_start_codon.category = "CDS"
Mikado will recognize that “derived_children” is a normal property, while “combined_cds_length”, “combined_cds_num” and “has_start_codon” are Metrics (and as such, we assign them a “category” - by default, that attribute will be None
.). Please note that Metrics behave and are coded like normal properties in any other regard - including docstrings and setters/deleters.
The requirements expression is evaluated using eval
.
Warning
While we took pains to ensure that the expression is properly sanitised and inspected before eval
, Mikado might prove itself to be permeable to clever code injection attacks. Do not execute Mikado with super user privileges if you do not want to risk from such attacks, and always inspect third-party YAML scoring files before execution!
References¶
[Cufflinks] | Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Cole Trapnell, Brian Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Jeltje van Baren, Steven Salzberg, Barbara Wold, Lior Pachter. Nature Biotechnology, 2010, doi:10.1038/nbt.1621 |
[ParsEval] | ParsEval: parallel comparison and analysis of gene structure annotations Daniel S Standage and Volker P Brendel. BMC Bioinformatics, 2012, doi:10.1186/1471-2105-13-187 |
[RGASP] | Assessment of transcript reconstruction methods for RNA-seq Tamara Steijger, Josep F Abril, Pär G Engström, Felix Kokocinski, The RGASP Consortium, Tim J Hubbard, Roderic Guigó, Jennifer Harrow & Paul Bertone. Nature Methods, 2013, doi:10.1038/nmeth.2714 |
[SnowyOwl] | SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models Ian Reid, Nicholas O’Toole, Omar Zabaneh, Reza Nourzadeh, Mahmoud Dahdouli, Mostafa Abdellateef, Paul MK Gordon, Jung Soh, Gregory Butler, Christoph W Sensen and Adrian Tsang. BMC Bioinformatics, 2014, doi:10.1186/1471-2105-15-229 |
[CuffMerge] | Identification of novel transcripts in annotated genomes using RNA-Seq Adam Roberts, Harold Pimentel, Cole Trapnell and Lior Pachter. Bioinformatics, 2011, doi:10.1093/bioinformatics/btr355 |
[Class2] | CLASS2: accurate and efficient splice variant annotation from RNA-seq reads Li Song, Sarven Sabunciyan and Liliana Florea. Bioinformatics, 2016, doi:10.1093/nar/gkw158 |
[PyFaidx] | Efficient “pythonic” access to FASTA files using pyfaidx Matthew D Shirley, Zhaorong Ma, Brent S Pedersen and Sarah J Wheelan. PeerJ PrePrints 3:e1196, 2015. doi:10.7287/peerj.preprints.970v1 |
[Snake] | Snakemake—a scalable bioinformatics workflow engine Johannes Köster and Sven Rahmann1. Bioinformatics, 2012, doi:10.1093/bioinformatics/bts480 |
[Trinity] | De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity Brian J Haas, et al. Nature Protocols, 2013. doi:10.1038/nprot.2013.084 |
[Blastplus] | BLAST+: architecture and applications. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BMC Bioinformatics, 2009. doi:10.1186/1471-2105-10-421 |
[STAR] | STAR: ultrafast universal RNA-seq aligner Alexander Dobin, Carrie A. Davis1, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha1, Philippe Batut1, Mark Chaisson and Thomas R. Gingeras. Bioinformatics, 2012. doi:10.1093/bioinformatics/bts635 |
[Hisat] | HISAT: a fast spliced aligner with low memory requirements Daehwan Kim, Ben Langmead and Stevan L Salzberg. Nature Methods, 2015. doi:10.1038/nmeth.3317 |
[TopHat2] | TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley and Steven L Salzberg. Genome Biology, 2013. doi:10.1186/gb-2013-14-4-r36 |
[StringTie] | StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Mihaela Pertea, Geo M Pertea, Corina M Antonescu, Tsung-Cheng Chang, Joshua T Mendell and Steven L Salzberg. Nature Biotechnology, 2015. doi:10.1038/nbt.3122 |
[GMAP] | GMAP: a genomic mapping and alignment program for mRNA and EST sequences Thomas D. Wu and Colin K. Watanabe. Bioinformatics 2005. doi:10.1093/bioinformatics/bti310 |
[uORFs] | uAUG and uORFs in human and rodent 5′untranslated mRNAs. Michele Iacono, Flavio Mignone and Graziano Pesole. Gene, 2005. doi:10.1016/j.gene.2004.11.041 |
[PyYaml] | Pyyaml library K Simonov. http://pyyaml.org/, 2006. |
[Cython] | Cython: the best of both worlds. Stefan Behnel, RObert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn and Kurt Smith. AIP - Computing in science & engineering, 2011. doi:10.1109/MCSE.2010.118 |
[Numpy] | The NumPy array: A structure for efficient numerical computation. Stefan van der Walt, S. Chris Colbert and Gael Varoquaux. Computing in Science & Engineering, 2011. doi:10.1109/MCSE.2011.37 |
[Scipy] | SciPy: Open Source Scientific Tools for Python. Eric Jones and Travis Oliphant and Pearu Peterson et al. http://www.scipy.org/*, 2001. |
[NetworkX] | Exploring network structure, dynamics, and function using NetworkX Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart. Proceedings of the 7th Python in Science Conference (SciPy2008), 2008. doi: |
[BioPython] | Biopython: freely available Python tools for computational molecular biology and bioinformatics. PA Cock, T Antao, JT Chang, BA Bradman, CJ Cox, A Dalke, I Friedberg, T Hamelryck, F Kauff, B Wilczynski and MJL de Hoon. Bioinformatics, 2009. doi:10.1093/bioinformatics/btp163 |
[SciKit] | Scikit-learn: Machine Learning in Python. Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot and Édouard Duchesnay. The Journal of Machine Learning Research, 2011. doi: |
[DRMAA] | Distributed resource management application API Version 2 (DRMAA). P Tröger, R Brobst, D Gruber, M Mamonski and D Templeton. Open Grid Forum, 2012. doi: |
[Apollo] | Web Apollo: a web-based genomic annotation editing platform Eduardo Lee, Gregg A Helt, Justin T Reese, Monica C Munoz-Torres, Chris P Childers, Robert M Buels, Lincoln Stein, Ian H Holmes, Christine G Elsik and Suzanna E Lewis. Genome Biology, 2013, doi:10.1186/gb-2013-14-8-r93 |
[Rampart] | RAMPART: a workflow management system for de novo genome assembly, Daniel Mapleson, Nizar Drou and David Swarbreck. Bioinformatics, 2015. doi:10.1093/bioinformatics/btv056 |
[Uniprot] | UniProt: a hub for protein information The UniProt Consortium. Nucleic Acid Research, 2014. doi:10.1093/nar/gku989 |
[Portcullis] | http://www.github.com/maplesod/portcullis/ |
[Oases] | Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels Marcel H Schulz, Daniel R Zerbino, Martin Vingron and Ewan Birney. Bioinformatics, 2012. doi: 10.1093/bioinformatics/bts094 |
[Bridger] | Bridger: a new framework for de novo transcriptome assembly using RNA-seq data Zheng Chang, Guojun Li, Juntao Liu, Cody Ashby, Deli Liu, Carole L Cramer and Xiuzhen Huang. Genome Biology, 2015. doi:10.1186/s13059-015-0596-2 |
[EviGeneTobacco] | Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana Kenlee Nakasugi, Ross Crowhurst, Julia Bally, Peter Waterhouse. PLoS ONE, 2014. doi:10.1371/journal.pone.0091776 |
[TransAbyss] | De novo assembly and analysis of RNA-seq data Gordon Robertson et al., Nature Methods, 2010. doi:10.1038/nmeth.1517 |
[Transrate] | TransRate: reference-free quality assessment of de novo transcriptome assemblies Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M. Hibberd and Steven Kelly. Genome Research, 2016. doi:10.1101/gr.196469.115 |
[RSEMeval] | Evaluation of de novo transcriptome assemblies from RNA-Seq data Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron Stewart and Colin N Dewey. Genome Biology, 2014. doi:10.1186/s13059-014-0553-5 |
[Maker] | MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491 |
[EVM] | Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments Brian J Haas, Steven L Salzberg, Wei Zhu, Mihaela Pertea, Jonathan E Allen, Joshua Orvis, Owen White, C Robin Buell and Jennifer R Wortman. Genome Biology, 2008. doi:10.1186/gb-2008-9-1-r7 |
[Augustus] | WebAUGUSTUS—a web service for training AUGUSTUS and predicting genes in eukaryotes Katharina J. Hoff and Mario Stanke. Nucleic Acid Research, 2013. doi:10.1093/nar/gkt418 |
[Maker2] | MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects Carson Holt and Mark Yandell. BMC Bioinformatics, 2011. doi:10.1186/1471-2105-12-491 |
[PYinterval] | https://github.com/chaimleib/intervaltree |
[BXPython] | https://bitbucket.org/james_taylor/bx-python/overview |
[Snakeviz] | https://jiffyclub.github.io/snakeviz/ |