.. _F1: https://en.wikipedia.org/wiki/F1_score .. _Cuffcompare: http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/index.html .. _ParsEval: https://aegean.readthedocs.io/en/v0.16.0/parseval.html .. _Compare: Compare ======= Overview ~~~~~~~~ This Mikado utility allows the user to compare the transcripts from any two annotations. Its output allows: - To understand which reference transcript each prediction is most similar to - To understand which prediction transcript best represent each reference model - To have a summary information about the similarity between the two annotations. Mikado compare has been directly inspired by the popular `Cuffcompare`_ [Cufflinks]_ utility and by `ParsEval`_ [ParsEval]_. Please note that while superficially similar to Cuffcompare in the style of the output files, Mikado compare is more philosophically similar to ParsEval, as it will not try to aggregate transcripts in loci but will perform a pure comparison between the two annotation files. Mikado compare can accept BAM, BED12, GTF and GFF files as input, both for the reference or for the prediction. Usage ~~~~~ Mikado compare is invoked by specifying the *reference* annotation and the desired mode of analysis. There are three possible options: #. In its default mode, compare will ask for a *prediction* annotation to compare the reference against. #. In the *"self"* mode, compare will do a self-comparison of the reference against itself, excluding as possible results the matches between a transcript and itself. It can be useful to glean the relationships between transcripts and genes in an annotation. #. In the *"internal"* mode of operations, compare will again perform a self-comparison, focussed on multi-isoform genes. For those, compare will perform and report all possible comparisons. It is useful to understand the relationships between the transcripts in a single locus. Mikado stores the information of the reference in a specialised SQLite index, with a ".midx" suffix, which will be created by the program upon its first execution with a new reference. If the index file is already present, Mikado will try to use it rather than read again the annotation. .. note:: Starting from version 1.5, Mikado compare supports multiprocessing. Please note that memory usage scales approximately **linearly** with the amount of processes requested. Command line ------------ .. code-block:: bash usage: Mikado compare [-h] -r REFERENCE (-p PREDICTION | --self | --internal | --index) [--distance DISTANCE] [-pc] [-o OUT] [--lenient] [-eu] [-n] [-erm] [-upa] [-l LOG] [-v] [-z] [--processes PROCESSES] optional arguments: -h, --help show this help message and exit --distance DISTANCE Maximum distance for a transcript to be considered a polymerase run-on. Default: 2000 -pc, --protein-coding Flag. If set, only transcripts with a CDS (both in reference and prediction) will be considered. -o OUT, --out OUT Prefix for the output files. Default: mikado_compare --lenient If set, exonic statistics will be calculated leniently in the TMAP as well - ie they will consider an exon as match even if only the internal junction has been recovered. -eu, --exclude-utr Flag. If set, reference and prediction transcripts will be stripped of their UTRs (if they are coding). -n, --no-index, --no-save-index Unless this flag is set, compare will save an index of the reference to quicken multiple calls. -erm, --extended-refmap Flag. If set, the RefMap will also contain recall and precision statistics - not just the F1. -upa, --use-prediction-alias Flag. If set, Mikado Compare will use the alias - rather than the transcript ID - to report the results for prediction transcripts in the TMAP and REFMAP files. -l LOG, --log LOG -v, --verbose -z, --gzip Flag. If set, TMAP and REFMAP files will be GZipped. --processes PROCESSES Prediction and annotation files.: -r REFERENCE, --reference REFERENCE Reference annotation file. By default, an index will be crated and saved with the suffix ".midx". -p PREDICTION, --prediction PREDICTION Prediction annotation file. --self Flag. If set, the reference will be compared with itself. Useful for understanding how the reference transcripts interact with each other. --internal Flag. If set, for each gene with more than one transcript isoform each will be compared to the others. Useful for understanding the structural relationships between the transcripts in each gene. --index Flag. If set, compare will stop after having generated the GFF index for the reference. Output files ~~~~~~~~~~~~ Mikado compare produces two tabular files, tmap_ and refmap_, and one :ref:`statistics ` file. .. _tmap: TMAP files ---------- TMAP are tabular files that store the information regarding the best match for each prediction in the reference. The columns are as follows: #. **ref_id**: Transcript ID of the matched reference model(s). #. **ref_gene**: Gene ID of the matched reference model(s). #. **ccode**: class code of the match. See :ref:`the relevant section on Class codes `. #. **tid**: Transcript ID of the prediction model. #. **gid**: Gene ID of the prediction model. #. **tid_num_exons**: Number of exons of the prediction model. #. **ref_num_exons**: Number of exons of the reference model. #. **n_prec**: Nucleotide precision of the prediction ( TP / (length of the prediction)) #. **n_recall**: Nucleotide recall of the reference (TP / (length of the reference)) #. **n_f1**: `F1`_ of recall and precision at the nucleotide level. #. **j_prec**: Splice junction precision of the prediction model ( TP / (number of splice sites in the prediction)) #. **j_recall**: Splice junction recall of the reference model ( TP / (number of splice sites in the reference)) #. **j_f1**: `F1`_ of recall and precision at the splice junction level. #. **e_prec**: Exon precision of the prediction model ( TP / (number of exons in the prediction)). **NB**: this value is calculated "leniently", ie terminal exons count as a match if the *internal* border is called correctly and the exon is terminal in both prediction and reference. #. **e_recall**: Exon recall of the reference model ( TP / (number of exons in the reference)) #. **e_f1**: `F1`_ of recall and precision at the exon level. #. **distance**: Distance of the model from its putative match. #. **location**: location of the match, with the format :.. An example of TMAP file is as follows:: ref_id ref_gene ccode tid gid tid_num_exons ref_num_exons n_prec n_recall n_f1 j_prec j_recall j_f1 e_prec e_recall e_f1 distance location AT5G66600.2 AT5G66600 = cuff_cufflinks_star_at.23553.1 cuff_cufflinks_star_at.23553.1.gene 9 9 91.30 81.31 86.02 100.00 100.00 100.00 77.78 77.78 77.78 0 Chr5:26575000..26578163 AT5G66600.2 AT5G66600 C cl_Chr5.6272 cl_Chr5.6272.gene 7 9 94.95 72.43 82.18 100.00 75.00 85.71 85.71 66.67 75.00 0 Chr5:26575000..26578087 AT5G66620.1,AT5G66630.1,AT5G66631.1 AT5G66620,AT5G66630,AT5G66631 f,j,j,G st_Stringtie_STAR.21710.15 st_Stringtie_STAR.21710.15.gene 8 11,10,1 19.13,19.95,35.98 54.57,45.65,100.00 28.33,27.76,52.92 28.57,64.29,0.00 20.00,50.00,0.00 23.53,56.25,0.00 12.50,37.50,0.00 9.09,30.00,0.00 10.53,33.33,0.00 0 Chr5:26588402..26598231 You can notice that the third example is particular as the prediction transcript matches not one but multiple reference transcripts. This is a fusion_ event. .. _refmap: RefMap files ------------ RefMap files are tabular files which store the information regarding the best match for each reference transcript, among all possible prediction models. The columns of the file are as follows: #. **ref_id**: Transcript ID of the reference model. #. **ccode**: class code of the match. See :ref:`the relevant section on Class codes `. #. **tid**: Transcript ID of the prediction model. #. **gid**: Gene ID of the prediction model. #. **nF1**: `F1`_ of recall and precision at the nucleotide level. #. **jF1**: `F1`_ of recall and precision at the splice junction level. #. **eF1**: `F1`_ of recall and precision at the exon level. **NB**: this value is calculated "leniently", ie terminal exons count as a match if the *internal* border is called correctly and the exon is terminal in both prediction and reference. #. **ref_gene**: Gene ID of the reference model. #. **best_ccode**: Best possible class code found for any of the transcripts of the gene. #. **best_tid**: Transcript ID of the prediction model which fit best one of the transcript models of the reference gene. #. **best_gid**: Gene ID of the prediction model which fit best one of the transcript models of the reference gene. #. **best_nF1**: `F1`_ of recall and precision at the nucleotide level, for the best possible comparison. #. **best_jF1**: `F1`_ of recall and precision at the splice junction level, for the best possible comparison. #. **best_eF1**: `F1`_ of recall and precision at the exon level, for the best possible comparison. #. **location**: location of the match, with the format :.. An example of a RefMap file is as follows:: ref_id ccode tid gid nF1 jF1 eF1 ref_gene best_ccode best_tid best_gid best_nF1 best_jF1 best_eF1 location AT5G66610.1 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 AT5G66610 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 Chr5:26584780..26587912 AT5G66610.2 J mikado.Chr5G4.2 mikado.Chr5G4 93.91 94.74 76.19 AT5G66610 = mikado.Chr5G4.2 mikado.Chr5G4 98.46 100.0 81.82 Chr5:26584774..26587912 AT5G66620.1 j mikado.Chr5G6.1 mikado.Chr5G6 85.51 95.0 72.73 AT5G66620 j mikado.Chr5G6.1 mikado.Chr5G6 85.51 95.0 72.73 Chr5:26588402..26592423 AT5G66630.1 n mikado.Chr5G8.2 mikado.Chr5G8 93.27 94.74 76.19 AT5G66630 n mikado.Chr5G8.2 mikado.Chr5G8 93.27 94.74 76.19 Chr5:26591981..26595922 Please note that the third example (AT5G66630.1) has as best possible match a fusion_ event. Extended RefMap files ^^^^^^^^^^^^^^^^^^^^^ Mikado can optionally produce more detailed RefMap files, listing the details on recall and precision for each match (rather than just the F1 for each level). If the flag ``-erm`` is present on the command line, the following extra-fields will be present in the file: #. **nRecall**: recall at the nucleotide level. #. **nPrecision**: precision at the nucleotide level. #. **jRecall**: recall at the junction level. #. **jPrecision**: precision at the junction level. #. **eRecall**: recall at the exon level. #. **ePrecision**: precision at the exon level. #. **best_nRecall**: recall at the nucleotide level, for the best possible comparison. #. **best_nPrecision**: precision at the nucleotide level, for the best possible comparison. #. **best_jRecall**: recall at the junction level, for the best possible comparison. #. **best_jPrecision**: precision at the junction level, for the best possible comparison. #. **best_eRecall**: recall at the exon level, for the best possible comparison. #. **best_ePrecision**: precision at the exon level, for the best possible comparison. .. _stats: Stats files ----------- These files provide a summary of the comparison between the reference and the annotation. An example is as follows:: Command line: /home/lucve/miniconda3/envs/mikado2/bin/mikado compare -r reference.gff3 -p Daijin/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o compare -l compare.log 18 reference RNAs in 12 genes 22 predicted RNAs in 15 genes --------------------------------- | Sn | Pr | F1 | Base level: 94.90 83.22 88.68 Exon level (stringent): 80.56 71.60 75.82 Exon level (lenient): 91.18 76.54 83.22 Splice site level: 95.19 81.15 87.61 Intron level: 96.84 88.19 92.31 Intron level (NR): 94.34 79.37 86.21 Intron chain level: 69.23 50.00 58.06 Intron chain level (NR): 69.23 50.00 58.06 Transcript level (stringent): 55.56 45.45 50.00 Transcript level (>=95% base F1): 72.22 59.09 65.00 Transcript level (>=80% base F1): 72.22 59.09 65.00 Gene level (100% base F1): 75.00 60.00 66.67 Gene level (>=95% base F1): 83.33 66.67 74.07 Gene level (>=80% base F1): 83.33 66.67 74.07 # Matching: in prediction; matched: in reference. Matching intron chains: 9 Matched intron chains: 9 Matching monoexonic transcripts: 4 Matched monoexonic transcripts: 4 Total matching transcripts: 13 Total matched transcripts: 13 Missed exons (stringent): 14/72 (19.44%) Novel exons (stringent): 23/81 (28.40%) Missed exons (lenient): 6/68 (8.82%) Novel exons (lenient): 19/81 (23.46%) Missed introns: 3/53 (5.66%) Novel introns: 13/63 (20.63%) Missed transcripts (0% nF1): 0/18 (0.00%) Novel transcripts (0% nF1): 3/22 (13.64%) Missed genes (0% nF1): 0/12 (0.00%) Novel genes (0% nF1): 3/15 (20.00%) The first section of the file describes: #. Concordance of the two annotations at the base level (recall, precision, and F1) #. Concordance of the two annotation at the exonic level (recall, precision, and F1), in two ways: * *"stringent"*: only perfect exonic matches are considered. * *"lenient"*: in this mode, terminal exons are counted as a match if the **internal** border is matched. See the RGASP paper [RGASP]_ for details on the rationale. #. Concordance of the two annotations in regards with the splice junctions, analysed independently of one another. #. Concordance of the two annotations at the intron level. #. Concordance of the two annotations at the intron chain level - how many intron chains of the reference are found identical in the prediction. Only multiexonic models are considered for this level. #. Concordance of the two annotations at the transcript level, in three different modes: * *"stringent"*: in this mode, only perfect matches are considered. * *"95% base F1"*: in this mode, we only count instances where the nucleotide F1 is greater than *95%* and, for multiexonic transcripts, the intron chain is reconstructed perfectly. * *"80% base F1"*: in this mode, we only count instances where the nucleotide F1 is greater than *80%* and, for multiexonic transcripts, the intron chain is reconstructed perfectly. #. Concordance of the two annotations at the gene level, in three different modes: * *"stringent"*: in this mode, we consider reference genes for which it was possible to find at least one perfect match for one of its transcripts. * *"95% base F1"*: in this mode, we only count instances where the nucleotide F1 is greater than *95%* and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic. * *"80% base F1"*: in this mode, we only count instances where the nucleotide F1 is greater than *80%* and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic. In the second section, the file reports how many of the intron chains, monoexonic transcripts and total transcripts in the **reference** were *matched* by at least one *matching* **prediction** transcript. Finally, in the third section the file reports the number of missed (present in the reference but not in the prediction) or novel (viceversa - present in the prediction but not in the reference) features. .. note:: Please note that a gene might be considered as "found" even if its best match is intronic, on the opposite strand, or not directly overlapping it, or is in the opposite strand (see :ref:`next section `, in particular the *Intronic*, *Fragment* and *No overlap* categories). .. _ccodes: Class codes ~~~~~~~~~~~ In addition to recall, precision and F1 values, Mikado assign each comparison between two transcripts a *class code*, which summarises the relationship between the two transcripts. The idea is lifted from the popular tool `Cuffcompare`_, although Mikado greatly extends the catalogue of possible class codes. All class codes fall within one of the following categories: - **Match**: class codes of this type indicate concordance between the two transcript models. - **Extension**: class codes of this type indicate that one of the two models extends the intron chain of the other, without internal interruptions. The extension can be from either perspective - either the prediction extends the reference, or it is instead *contained* within the reference (so that switching perspectives, the reference would "extend" the prediction). - **Alternative splicing**: the two exon chains overlap but differ in significant ways. - **Intronic**: either the prediction is completely contained within the introns of the reference, or viceversa. - **Overlap**: the two transcript models generically overlap on their exonic sequence. - **Fragment**: the prediction is a fragment of the reference, in most cases because they are on opposite strands. - **No overlap**: the prediction and the reference are near but do not directly overlap. .. _fusion: - **Fusion**: this special class code is a qualifier and it never appears on its own. When a transcript is defined as a fusion between two or more reference transcript, it will appear on *multiple lines*, one for each of the matches. In each line, the class code will be a "f," followed by the class code assigned to that particular comparison. So e.g. a prediction which matches two reference models, one with a "j" and another with a "o", will be present in two different lines; the first one with a class code of **f,j** and the second with a class code of **f,o**. In the *refmap* file, if the fusion is the best match, the class code will be "f" followed by the class code for the individual reference transcript; e.g., **"f,j"** .. topic:: Available class codes +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | Class code | Definition | Reference multiexonic? | Prediction multiexonic? | Nucleotide: RC, PC, F1 | Junction: RC, PC, F1 | Reverse | Category | +==============+================================+==========================+===========================+============================+========================+===========+=============+ +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | = | Complete intron chain match. | True | True | NA | 100%, 100%, 100% | = | Match | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | _ | Complete match between two | False | False | NA, NA, >=80% | NA | _ | Match | | | monoexonic transcripts. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | n | Intron chain extension, ie. | True | True | 100%, < 100%, <100% | 100%, < 100%, <100% | c | Extension | | | both transcripts are | | | | | | | | | multiexonic and the | | | | | | | | | prediction has novel splice | | | | | | | | | sites outside of the reference | | | | | | | | | transcript boundaries. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | J | Intron chain extension, ie. | True | True | 100%, <= 100%, <100% | 100%, < 100%, <100% | C | Extension | | | both transcripts are | | | | | | | | | multiexonic and the | | | | | | | | | prediction has novel splice | | | | | | | | | sites inside of the reference | | | | | | | | | transcript boundaries. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | c | The prediction is either | NA | NA | < 100%, 100%, NA | < 100%, 100%, NA | n | Extension | | | multiexonic and with its | | | | | | | | | intron chain completely | | | | | | | | | contained within that of | | | | | | | | | the reference, or monoexonic | | | | | | | | | and contained within one of | | | | | | | | | the reference exons. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | C | The prediction intron chain is | True | True | <= 100%, < 100%, < 100% | < 100%, 100%, < 100% | J or j | Extension | | | completely contained within | | | | | | | | | that of the reference | | | | | | | | | transcript, but it partially | | | | | | | | | debords either into its | | | | | | | | | introns or outside of the | | | | | | | | | reference boundaries. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | j | Alternative splicing event. | True | True | NA | <= 100%, 100%, < 100% | j or C | Alternative | | | | | | | | | splicing | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | h | Structural match between two | True | True | > 0%, > 0%, > 0% | 0%, 0%, 0% | h | Alternative | | | models where where no splice | | | | | | splicing | | | site is conserved but at least | | | | | | | | | one intron of the reference | | | | | | | | | and one intron of the | | | | | | | | | prediction partially overlap. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | g | The monoexonic prediction | True | False | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | G | Alternative | | | overlaps one or more exons of | | | | | | splicing | | | the reference transcript; | | | | | | | | | the borders of the prediction | | | | | | | | | cannot fall inside the introns | | | | | | | | | of the reference. The | | | | | | | | | prediction transcript can | | | | | | | | | bridge multiple exons of the | | | | | | | | | reference model. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | G | Generic match of a multiexonic | False | True | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | g | Alternative | | | prediction transcript versus a | | | | | | splicing | | | monoexonic reference. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | o | Generic overlap between two | True | True | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | o | Overlap | | | multiexonic transcripts, | | | | | | | | | which do not share any overlap | | | | | | | | | among their introns. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | e | Single exon transcript | True | False | > 0%, > 0%, 0% < F1 < 100% | 0%, 0%, 0% | G | Overlap | | | overlapping one reference exon | | | | | | | | | and at least 10 bps of a | | | | | | | | | reference intron, indicating a | | | | | | | | | possible pre-mRNA fragment. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | m | Generic match between two | False | False | NA, NA, < 80% | NA | m | Overlap | | | monoexonic transcripts. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | i | Monoexonic prediction | True | False | 0%, 0%, 0% | 0%, 0%, 0% | ri | Intronic | | | completely contained within | | | | | | | | | one intron of the reference | | | | | | | | | transcript. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | I | Prediction completely | True | True | 0%, 0%, 0% | 0%, 0%, 0% | rI | Intronic | | | contained within the introns | | | | | | | | | of the reference transcript. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | ri | Reverse intron transcript - | False | True | 0%, 0%, 0% | 0%, 0%, 0% | i | Intronic | | | the monoexonic reference is | | | | | | | | | completely contained | | | | | | | | | within one intron of the | | | | | | | | | prediction transcript. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | rI | Multiexonic reference | True | True | 0%, 0%, 0% | 0%, 0%, 0% | I | Intronic | | | completely contained within | | | | | | | | | the introns of the prediction | | | | | | | | | transcript. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | f | Fusion - this special code is | NA | NA | > 10%, 0%, 0% | > 0%, 0%, 0% | NA | Fusion | | | applied when a prediction | | | | | | | | | intersects more than one | | | | | | | | | reference transcript. To be | | | | | | | | | considered for fusions, | | | | | | | | | candidate references must | | | | | | | | | **either** share at least one | | | | | | | | | splice junction with the | | | | | | | | | prediction, **or** have at | | | | | | | | | least 10% of its bases | | | | | | | | | recalled. If two or more | | | | | | | | | reference transcripts fit | | | | | | | | | these constraints, then the | | | | | | | | | prediction model is classified | | | | | | | | | as a fusion. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | x | Monoexonic match on the | NA | False | >0%, >0%, >0% | 0%, 0%, 0% | x or X | Fragment | | | **opposite** strand. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | X | Multiexonic match on the | NA | True | >0%, >0%, >0% | NA | x or X | Fragment | | | **opposite** strand. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | p | The prediction is on the same | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | p | Fragment | | | strand of a neighbouring but | | | | | | | | | non-overlapping transcript. | | | | | | | | | Probable polymerase run-on | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | P | The prediction is on the | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | P | Fragment | | | opposite strand of a | | | | | | | | | neighbouring but non- | | | | | | | | | overlapping transcript. | | | | | | | | | Probable polymerase run-on. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ | u | Unknown - no suitable model | NA | NA | 0%, 0%, 0% | 0%, 0%, 0% | NA | Unknown | | | has been found near enough the | | | | | | | | | prediction to perform a | | | | | | | | | comparison. | | | | | | | +--------------+--------------------------------+--------------------------+---------------------------+----------------------------+------------------------+-----------+-------------+ Technical details ~~~~~~~~~~~~~~~~~ Mikado compare conceptualizes the reference annotation as a collection of interval trees, one per chromosome or scaffold, where each node corresponds to an array of genes at the location. The gene and transcript objects are stored separately. The location of each transcript model in the prediction is queried against the tree, with a padding (default 2kbps) to allow for neighouring but non-overlapping genes, and the transcript itself is subsequently compared with each reference transcript contained in the hits. Each comparison will yield precision, recall and F1 values for the nucleotide, splice junction and exonic levels, together with an associated class code. The best match for the prediction is selected for by choosing the comparison yielding the best splice junction F1 and the best nucleotide F1, in this order. If the prediction transcript overlaps two or more genes on the same strand, and for at least two it has one match each with either 10% nucleotide recall or junction recall over 0%, it is deemed as a fusion_ event, and its line in the tmap_ file will report the best match against each of the fused genes, separated by comma. Each calculated match against a reference transcript is stored as a potential *best match* for the reference transcript. At the end of the run, the hits for each reference transcript will be ordered using the following function: .. code-block:: python :linenos: @staticmethod def result_sorter(result): """ Method to sort the results for the refmap. Order: - CCode does not contain "x", "P", "p" (i.e. fragments on opposite strand or polymerase run-on fragments) - Exonic F1 (e_f1) - Junction F1 (j_f1) - "f" in ccode (i.e. transcript is a fusion) - Nucleotide F1 (n_f1) :param result: a resultStorer object :type result: ResultStorer :return: (int, float, float, float) """ bad_ccodes = ["x", "X", "P", "p"] bad_ccodes = set(bad_ccodes) orderer = (len(set.intersection(bad_ccodes, set(result.ccode))) == 0, result.j_f1, result.e_f1, result.n_f1, "f" in result.ccode) return orderer This function is used to select both for the best match *for the transcript*, as well as to select among these matches for the best match *for the gene*. The interval tree data structure is created using Cython code originally part of the `bx-python `_, kindly provided by `Dr. Taylor `_ for modification and inclusion in Mikado. We subsequently integrated the improvements made on the code by Dr. Brent Pedersen in his :ref:`quicksect fork `. We further modified the original code for allowing "fuzzy matches", as in, allowing some leeway on how far the edges of the query interval are from the target intervals present in the tree. The .midx files storing the annotation for repeated compare runs are SQLite files. In them, Mikado will store for each gene its coordinates, its transcripts, and the location of exons and CDS features. MIDX files make repeated runs quite faster, as the program will not have to re-parse the GFF file. .. note:: Before version 1.1, Mikado MIDX files were GZip-compressed files. If you try to use an old index, Mikado will complain about it and recreate it from scratch. The comparison code is written in Cython and is crucial during the :ref:`picking phase of Mikado `, not just for the functioning of the comparison utility.