7.2.1. Compare

7.2.1.1. Overview

This Mikado utility allows the user to compare the transcripts from any two annotations. Its output allows:

  • To understand which reference transcript each prediction is most similar to
  • To understand which prediction transcript best represent each reference model
  • To have a summary information about the similarity between the two annotations.

Mikado compare has been directly inspired by the popular Cuffcompare [Cufflinks] utility and by ParsEval [ParsEval]. Please note that while superficially similar to Cuffcompare in the style of the output files, Mikado compare is more philosophically similar to ParsEval, as it will not try to aggregate transcripts in loci but will perform a pure comparison between the two annotation files. Mikado compare can accept BAM, BED12, GTF and GFF files as input, both for the reference or for the prediction.

7.2.1.2. Usage

Mikado compare is invoked by specifying the reference annotation and the desired mode of analysis. There are three possible options:

  1. In its default mode, compare will ask for a prediction annotation to compare the reference against.
  2. In the “self” mode, compare will do a self-comparison of the reference against itself, excluding as possible results the matches between a transcript and itself. It can be useful to glean the relationships between transcripts and genes in an annotation.
  3. In the “internal” mode of operations, compare will again perform a self-comparison, focussed on multi-isoform genes. For those, compare will perform and report all possible comparisons. It is useful to understand the relationships between the transcripts in a single locus.

Mikado stores the information of the reference in a specialised SQLite index, with a “.midx” suffix, which will be created by the program upon its first execution with a new reference. If the index file is already present, Mikado will try to use it rather than read again the annotation.

Note

Starting from version 1.5, Mikado compare supports multiprocessing. Please note that memory usage scales approximately linearly with the amount of processes requested.

7.2.1.2.1. Command line

    usage: Mikado compare [-h] -r REFERENCE
                      (-p PREDICTION | --self | --internal | --index)
                      [--distance DISTANCE] [-pc] [-o OUT] [--lenient] [-eu]
                      [-n] [-erm] [-upa] [-l LOG] [-v] [-z]
                      [--processes PROCESSES]

optional arguments:
  -h, --help            show this help message and exit
  --distance DISTANCE   Maximum distance for a transcript to be considered a
                        polymerase run-on. Default: 2000
  -pc, --protein-coding
                        Flag. If set, only transcripts with a CDS (both in
                        reference and prediction) will be considered.
  -o OUT, --out OUT     Prefix for the output files. Default: mikado_compare
  --lenient             If set, exonic statistics will be calculated leniently
                        in the TMAP as well - ie they will consider an exon as
                        match even if only the internal junction has been
                        recovered.
  -eu, --exclude-utr    Flag. If set, reference and prediction transcripts
                        will be stripped of their UTRs (if they are coding).
  -n, --no-index, --no-save-index
                        Unless this flag is set, compare will save an index of
                        the reference to quicken multiple calls.
  -erm, --extended-refmap
                        Flag. If set, the RefMap will also contain recall and
                        precision statistics - not just the F1.
  -upa, --use-prediction-alias
                        Flag. If set, Mikado Compare will use the alias -
                        rather than the transcript ID - to report the results
                        for prediction transcripts in the TMAP and REFMAP
                        files.
  -l LOG, --log LOG
  -v, --verbose
  -z, --gzip            Flag. If set, TMAP and REFMAP files will be GZipped.
  --processes PROCESSES

Prediction and annotation files.:
  -r REFERENCE, --reference REFERENCE
                        Reference annotation file. By default, an index will
                        be crated and saved with the suffix ".midx".
  -p PREDICTION, --prediction PREDICTION
                        Prediction annotation file.
  --self                Flag. If set, the reference will be compared with
                        itself. Useful for understanding how the reference
                        transcripts interact with each other.
  --internal            Flag. If set, for each gene with more than one
                        transcript isoform each will be compared to the
                        others. Useful for understanding the structural
                        relationships between the transcripts in each gene.
  --index               Flag. If set, compare will stop after having generated
                        the GFF index for the reference.

7.2.1.3. Output files

Mikado compare produces two tabular files, tmap and refmap, and one statistics file.

7.2.1.3.1. TMAP files

TMAP are tabular files that store the information regarding the best match for each prediction in the reference. The columns are as follows:

  1. ref_id: Transcript ID of the matched reference model(s).
  2. ref_gene: Gene ID of the matched reference model(s).
  3. ccode: class code of the match. See the relevant section on Class codes.
  4. tid: Transcript ID of the prediction model.
  5. gid: Gene ID of the prediction model.
  6. tid_num_exons: Number of exons of the prediction model.
  7. ref_num_exons: Number of exons of the reference model.
  8. n_prec: Nucleotide precision of the prediction ( TP / (length of the prediction))
  9. n_recall: Nucleotide recall of the reference (TP / (length of the reference))
  10. n_f1: F1 of recall and precision at the nucleotide level.
  11. j_prec: Splice junction precision of the prediction model ( TP / (number of splice sites in the prediction))
  12. j_recall: Splice junction recall of the reference model ( TP / (number of splice sites in the reference))
  13. j_f1: F1 of recall and precision at the splice junction level.
  14. e_prec: Exon precision of the prediction model ( TP / (number of exons in the prediction)). NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  15. e_recall: Exon recall of the reference model ( TP / (number of exons in the reference))
  16. e_f1: F1 of recall and precision at the exon level.
  17. distance: Distance of the model from its putative match.
  18. location: location of the match, with the format <chromosome>:<start>..<end>

An example of TMAP file is as follows:

ref_id      ref_gene        ccode   tid     gid     tid_num_exons   ref_num_exons   n_prec  n_recall        n_f1    j_prec  j_recall        j_f1    e_prec  e_recall        e_f1    distance        location
AT5G66600.2 AT5G66600       =       cuff_cufflinks_star_at.23553.1  cuff_cufflinks_star_at.23553.1.gene     9       9       91.30   81.31   86.02   100.00  100.00  100.00  77.78   77.78   77.78   0       Chr5:26575000..26578163
AT5G66600.2 AT5G66600       C       cl_Chr5.6272    cl_Chr5.6272.gene       7       9       94.95   72.43   82.18   100.00  75.00   85.71   85.71   66.67   75.00   0       Chr5:26575000..26578087
AT5G66620.1,AT5G66630.1,AT5G66631.1 AT5G66620,AT5G66630,AT5G66631   f,j,j,G st_Stringtie_STAR.21710.15      st_Stringtie_STAR.21710.15.gene 8       11,10,1 19.13,19.95,35.98       54.57,45.65,100.00      28.33,27.76,52.92       28.57,64.29,0.00        20.00,50.00,0.00        23.53,56.25,0.00        12.50,37.50,0.00        9.09,30.00,0.00 10.53,33.33,0.00        0       Chr5:26588402..26598231

You can notice that the third example is particular as the prediction transcript matches not one but multiple reference transcripts. This is a fusion event.

7.2.1.3.2. RefMap files

RefMap files are tabular files which store the information regarding the best match for each reference transcript, among all possible prediction models. The columns of the file are as follows:

  1. ref_id: Transcript ID of the reference model.
  2. ccode: class code of the match. See the relevant section on Class codes.
  3. tid: Transcript ID of the prediction model.
  4. gid: Gene ID of the prediction model.
  5. nF1: F1 of recall and precision at the nucleotide level.
  6. jF1: F1 of recall and precision at the splice junction level.
  7. eF1: F1 of recall and precision at the exon level. NB: this value is calculated “leniently”, ie terminal exons count as a match if the internal border is called correctly and the exon is terminal in both prediction and reference.
  8. ref_gene: Gene ID of the reference model.
  9. best_ccode: Best possible class code found for any of the transcripts of the gene.
  10. best_tid: Transcript ID of the prediction model which fit best one of the transcript models of the reference gene.
  11. best_gid: Gene ID of the prediction model which fit best one of the transcript models of the reference gene.
  12. best_nF1: F1 of recall and precision at the nucleotide level, for the best possible comparison.
  13. best_jF1: F1 of recall and precision at the splice junction level, for the best possible comparison.
  14. best_eF1: F1 of recall and precision at the exon level, for the best possible comparison.
  15. location: location of the match, with the format <chromosome>:<start>..<end>

An example of a RefMap file is as follows:

ref_id      ccode   tid     gid     nF1     jF1     eF1     ref_gene        best_ccode      best_tid        best_gid        best_nF1        best_jF1        best_eF1    location
AT5G66610.1 =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584780..26587912
AT5G66610.2 J       mikado.Chr5G4.2 mikado.Chr5G4   93.91   94.74   76.19   AT5G66610       =       mikado.Chr5G4.2 mikado.Chr5G4   98.46   100.0   81.82   Chr5:26584774..26587912
AT5G66620.1 j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   AT5G66620       j       mikado.Chr5G6.1 mikado.Chr5G6   85.51   95.0    72.73   Chr5:26588402..26592423
AT5G66630.1 n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   AT5G66630       n       mikado.Chr5G8.2 mikado.Chr5G8   93.27   94.74   76.19   Chr5:26591981..26595922

Please note that the third example (AT5G66630.1) has as best possible match a fusion event.

7.2.1.3.2.1. Extended RefMap files

Mikado can optionally produce more detailed RefMap files, listing the details on recall and precision for each match (rather than just the F1 for each level). If the flag -erm is present on the command line, the following extra-fields will be present in the file:

  1. nRecall: recall at the nucleotide level.
  2. nPrecision: precision at the nucleotide level.
  3. jRecall: recall at the junction level.
  4. jPrecision: precision at the junction level.
  5. eRecall: recall at the exon level.
  6. ePrecision: precision at the exon level.
  7. best_nRecall: recall at the nucleotide level, for the best possible comparison.
  8. best_nPrecision: precision at the nucleotide level, for the best possible comparison.
  9. best_jRecall: recall at the junction level, for the best possible comparison.
  10. best_jPrecision: precision at the junction level, for the best possible comparison.
  11. best_eRecall: recall at the exon level, for the best possible comparison.
  12. best_ePrecision: precision at the exon level, for the best possible comparison.

7.2.1.3.3. Stats files

These files provide a summary of the comparison between the reference and the annotation. An example is as follows:

Command line:
/home/lucve/miniconda3/envs/mikado2/bin/mikado compare -r reference.gff3 -p Daijin/5-mikado/pick/permissive/mikado-permissive.loci.gff3 -o compare -l compare.log
18 reference RNAs in 12 genes
22 predicted RNAs in  15 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 94.90  83.22  88.68
            Exon level (stringent): 80.56  71.60  75.82
              Exon level (lenient): 91.18  76.54  83.22
                 Splice site level: 95.19  81.15  87.61
                      Intron level: 96.84  88.19  92.31
                 Intron level (NR): 94.34  79.37  86.21
                Intron chain level: 69.23  50.00  58.06
           Intron chain level (NR): 69.23  50.00  58.06
      Transcript level (stringent): 55.56  45.45  50.00
  Transcript level (>=95% base F1): 72.22  59.09  65.00
  Transcript level (>=80% base F1): 72.22  59.09  65.00
         Gene level (100% base F1): 75.00  60.00  66.67
        Gene level (>=95% base F1): 83.33  66.67  74.07
        Gene level (>=80% base F1): 83.33  66.67  74.07

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 9
             Matched intron chains: 9
   Matching monoexonic transcripts: 4
    Matched monoexonic transcripts: 4
        Total matching transcripts: 13
         Total matched transcripts: 13

          Missed exons (stringent): 14/72  (19.44%)
           Novel exons (stringent): 23/81  (28.40%)
            Missed exons (lenient): 6/68  (8.82%)
             Novel exons (lenient): 19/81  (23.46%)
                    Missed introns: 3/53  (5.66%)
                     Novel introns: 13/63  (20.63%)

       Missed transcripts (0% nF1): 0/18  (0.00%)
        Novel transcripts (0% nF1): 3/22  (13.64%)
             Missed genes (0% nF1): 0/12  (0.00%)
              Novel genes (0% nF1): 3/15  (20.00%)

The first section of the file describes:

  1. Concordance of the two annotations at the base level (recall, precision, and F1)
  2. Concordance of the two annotation at the exonic level (recall, precision, and F1), in two ways:
    • “stringent”: only perfect exonic matches are considered.
    • “lenient”: in this mode, terminal exons are counted as a match if the internal border is matched. See the RGASP paper [RGASP] for details on the rationale.
  3. Concordance of the two annotations in regards with the splice junctions, analysed independently of one another.
  4. Concordance of the two annotations at the intron level.
  5. Concordance of the two annotations at the intron chain level - how many intron chains of the reference are found identical in the prediction. Only multiexonic models are considered for this level.
  6. Concordance of the two annotations at the transcript level, in three different modes:
    • “stringent”: in this mode, only perfect matches are considered.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly.
  7. Concordance of the two annotations at the gene level, in three different modes:
    • “stringent”: in this mode, we consider reference genes for which it was possible to find at least one perfect match for one of its transcripts.
    • “95% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 95% and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic.
    • “80% base F1”: in this mode, we only count instances where the nucleotide F1 is greater than 80% and, for multiexonic transcripts, the intron chain is reconstructed perfectly. The best possible match is considered for this statistic.

In the second section, the file reports how many of the intron chains, monoexonic transcripts and total transcripts in the reference were matched by at least one matching prediction transcript. Finally, in the third section the file reports the number of missed (present in the reference but not in the prediction) or novel (viceversa - present in the prediction but not in the reference) features.

Note

Please note that a gene might be considered as “found” even if its best match is intronic, on the opposite strand, or not directly overlapping it, or is in the opposite strand (see next section, in particular the Intronic, Fragment and No overlap categories).

7.2.1.4. Class codes

In addition to recall, precision and F1 values, Mikado assign each comparison between two transcripts a class code, which summarises the relationship between the two transcripts. The idea is lifted from the popular tool Cuffcompare, although Mikado greatly extends the catalogue of possible class codes. All class codes fall within one of the following categories:

  • Match: class codes of this type indicate concordance between the two transcript models.
  • Extension: class codes of this type indicate that one of the two models extends the intron chain of the other, without internal interruptions. The extension can be from either perspective - either the prediction extends the reference, or it is instead contained within the reference (so that switching perspectives, the reference would “extend” the prediction).
  • Alternative splicing: the two exon chains overlap but differ in significant ways.
  • Intronic: either the prediction is completely contained within the introns of the reference, or viceversa.
  • Overlap: the two transcript models generically overlap on their exonic sequence.
  • Fragment: the prediction is a fragment of the reference, in most cases because they are on opposite strands.
  • No overlap: the prediction and the reference are near but do not directly overlap.
  • Fusion: this special class code is a qualifier and it never appears on its own. When a transcript is defined as a fusion between two or more reference transcript, it will appear on multiple lines, one for each of the matches. In each line, the class code will be a “f,” followed by the class code assigned to that particular comparison. So e.g. a prediction which matches two reference models, one with a “j” and another with a “o”, will be present in two different lines; the first one with a class code of f,j and the second with a class code of f,o. In the refmap file, if the fusion is the best match, the class code will be “f” followed by the class code for the individual reference transcript; e.g., “f,j”

Available class codes

Class code Definition Reference multiexonic? Prediction multiexonic? Nucleotide: RC, PC, F1 Junction: RC, PC, F1 Reverse Category
               
= Complete intron chain match. True True NA 100%, 100%, 100% = Match
_ Complete match between two monoexonic transcripts. False False NA, NA, >=80% NA _ Match
n Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites outside of the reference transcript boundaries. True True 100%, < 100%, <100% 100%, < 100%, <100% c Extension
J Intron chain extension, ie. both transcripts are multiexonic and the prediction has novel splice sites inside of the reference transcript boundaries. True True 100%, <= 100%, <100% 100%, < 100%, <100% C Extension
c The prediction is either multiexonic and with its intron chain completely contained within that of the reference, or monoexonic and contained within one of the reference exons. NA NA < 100%, 100%, NA < 100%, 100%, NA n Extension
C The prediction intron chain is completely contained within that of the reference transcript, but it partially debords either into its introns or outside of the reference boundaries. True True <= 100%, < 100%, < 100% < 100%, 100%, < 100% J or j Extension
j Alternative splicing event. True True NA <= 100%, 100%, < 100% j or C Alternative splicing
h Structural match between two models where where no splice site is conserved but at least one intron of the reference and one intron of the prediction partially overlap. True True > 0%, > 0%, > 0% 0%, 0%, 0% h Alternative splicing
g The monoexonic prediction overlaps one or more exons of the reference transcript; the borders of the prediction cannot fall inside the introns of the reference. The prediction transcript can bridge multiple exons of the reference model. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Alternative splicing
G Generic match of a multiexonic prediction transcript versus a monoexonic reference. False True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% g Alternative splicing
o Generic overlap between two multiexonic transcripts, which do not share any overlap among their introns. True True > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% o Overlap
e Single exon transcript overlapping one reference exon and at least 10 bps of a reference intron, indicating a possible pre-mRNA fragment. True False > 0%, > 0%, 0% < F1 < 100% 0%, 0%, 0% G Overlap
m Generic match between two monoexonic transcripts. False False NA, NA, < 80% NA m Overlap
i Monoexonic prediction completely contained within one intron of the reference transcript. True False 0%, 0%, 0% 0%, 0%, 0% ri Intronic
I Prediction completely contained within the introns of the reference transcript. True True 0%, 0%, 0% 0%, 0%, 0% rI Intronic
ri Reverse intron transcript - the monoexonic reference is completely contained within one intron of the prediction transcript. False True 0%, 0%, 0% 0%, 0%, 0% i Intronic
rI Multiexonic reference completely contained within the introns of the prediction transcript. True True 0%, 0%, 0% 0%, 0%, 0% I Intronic
f Fusion - this special code is applied when a prediction intersects more than one reference transcript. To be considered for fusions, candidate references must either share at least one splice junction with the prediction, or have at least 10% of its bases recalled. If two or more reference transcripts fit these constraints, then the prediction model is classified as a fusion. NA NA > 10%, 0%, 0% > 0%, 0%, 0% NA Fusion
x Monoexonic match on the opposite strand. NA False >0%, >0%, >0% 0%, 0%, 0% x or X Fragment
X Multiexonic match on the opposite strand. NA True >0%, >0%, >0% NA x or X Fragment
p The prediction is on the same strand of a neighbouring but non-overlapping transcript. Probable polymerase run-on NA NA 0%, 0%, 0% 0%, 0%, 0% p Fragment
P The prediction is on the opposite strand of a neighbouring but non- overlapping transcript. Probable polymerase run-on. NA NA 0%, 0%, 0% 0%, 0%, 0% P Fragment
u Unknown - no suitable model has been found near enough the prediction to perform a comparison. NA NA 0%, 0%, 0% 0%, 0%, 0% NA Unknown

7.2.1.5. Technical details

Mikado compare conceptualizes the reference annotation as a collection of interval trees, one per chromosome or scaffold, where each node corresponds to an array of genes at the location. The gene and transcript objects are stored separately. The location of each transcript model in the prediction is queried against the tree, with a padding (default 2kbps) to allow for neighouring but non-overlapping genes, and the transcript itself is subsequently compared with each reference transcript contained in the hits. Each comparison will yield precision, recall and F1 values for the nucleotide, splice junction and exonic levels, together with an associated class code. The best match for the prediction is selected for by choosing the comparison yielding the best splice junction F1 and the best nucleotide F1, in this order. If the prediction transcript overlaps two or more genes on the same strand, and for at least two it has one match each with either 10% nucleotide recall or junction recall over 0%, it is deemed as a fusion event, and its line in the tmap file will report the best match against each of the fused genes, separated by comma.

Each calculated match against a reference transcript is stored as a potential best match for the reference transcript. At the end of the run, the hits for each reference transcript will be ordered using the following function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@staticmethod
def result_sorter(result):

    """
    Method to sort the results for the refmap. Order:
    - CCode does not contain "x", "P", "p" (i.e. fragments on opposite strand or
    polymerase run-on fragments)
    - Exonic F1 (e_f1)
    - Junction F1 (j_f1)
    - "f" in ccode (i.e. transcript is a fusion)
    - Nucleotide F1 (n_f1)

    :param result: a resultStorer object
    :type result: ResultStorer
    :return: (int, float, float, float)
    """

    bad_ccodes = ["x", "X", "P", "p"]
    bad_ccodes = set(bad_ccodes)

    orderer = (len(set.intersection(bad_ccodes, set(result.ccode))) == 0,
               result.j_f1, result.e_f1,
               result.n_f1,
               "f" in result.ccode)

    return orderer

This function is used to select both for the best match for the transcript, as well as to select among these matches for the best match for the gene.

The interval tree data structure is created using Cython code originally part of the bx-python, kindly provided by Dr. Taylor for modification and inclusion in Mikado. We subsequently integrated the improvements made on the code by Dr. Brent Pedersen in his quicksect fork.

We further modified the original code for allowing “fuzzy matches”, as in, allowing some leeway on how far the edges of the query interval are from the target intervals present in the tree.

The .midx files storing the annotation for repeated compare runs are SQLite files. In them, Mikado will store for each gene its coordinates, its transcripts, and the location of exons and CDS features. MIDX files make repeated runs quite faster, as the program will not have to re-parse the GFF file.

Note

Before version 1.1, Mikado MIDX files were GZip-compressed files. If you try to use an old index, Mikado will complain about it and recreate it from scratch.

The comparison code is written in Cython and is crucial during the picking phase of Mikado, not just for the functioning of the comparison utility.