3. Tutorial

This tutorial will guide you through a simple analysis of Mikado, using a small amount of data coming from an experiment on Arabidopsis thaliana. RNA-Seq data was obtained from study PRJEB7093 on ENA, aligned with STAR [STAR] against the TAIR10 reference genome, and assembled with four different programs. For this small example, we are going to focus on a small genomic region: Chr5, from 26,575,364 to 26,614,205.

During this analysis, you will require the following files:

All of this data can also be found in the sample_data directory of the Mikado source.

You will also require the following software:

  • a functioning installation of SQLite.
  • a functioning version of BLAST+ [Blastplus].
  • a functioning version of Prodigal [Prodigal].

3.1. Creating the configuration file for Mikado

In the first step, we need to create a configuration file to drive Mikado. To do so, we will first create a tab-delimited file describing our assemblies (class.gtf, cufflinks.gtf, stringtie.gtf, trinity.gff3):

class.gtf       cl      True            False   False   True
cufflinks.gtf   cuff    True            False   False   True
stringtie.gtf   st      True    1       False   True    True
trinity.gff3    tr      False   -0.5    False   False   True
reference.gff3  at      True    5       True    False   False
pacbio.bam      pb      True    1       False   False   False

In this file, the first three fields define the following:

#. The file location and name (if no folder is specified, Mikado will look for each file in the current working directory) #. An alias associated with the file, which has to be unique #. A binary flag (True / False) indicating whether the assembly is strand-specific or not

These fields are then followed by a series of optional fields:

  1. A score associated with that sample. All transcripts associated with the label will have their score corrected by the value on this field. So eg. in this example all Stringtie models will receive an additional point, and all Trinity models will be penalised by half a point. Class and Cufflinks have no special bonus or malus associated.
  2. A binary flag (True / False) defining whether the sample is a reference or not.
  3. A binary flag (True / False) defining whether to exclude redundant models or not.
  4. A binary flag (True / False) indicating whether Mikado prepare should strip the CDS of faulty models, but otherwise keep their cDNA structure in the final output (True) or whether instead it should completely discard such models (False).
  5. A binary flag (True / False) instructing Mikado about whether the chimera split routine should be skipped for these models (True) or if instead it should proceed normally (False).

Finally, we will create the configuration file itself using mikado configure:

mikado configure --list list.txt --reference chr5.fas.gz --mode permissive --scoring plants.yaml  --copy-scoring

plants.yaml –junctions junctions.bed -bt uniprot_sprot_plants.fasta configuration.yaml

This will create a configuration.yaml file with the parameters that were specified on the command line. This is simplified configuration file, containing all the necessary parameters for the Mikado run. It will also copy the plants.yaml file from the Mikado installation to your current working directory.

Hint

Mikado can accept compressed genome FASTA files, like in this example, as long as they have been compressed with BGZip rather than the vanilla UNIX GZip.

  • –list list.txt: this part of the command line instructs Mikado to read the file we just created to understand where the input files are and how to treat them.
  • –scoring: the scoring file to use. Mikado ships with two pre-calculated scoring files, plant.yaml and

mammalian.yaml * –copy-scoring: instruct Mikado to copy the scoring file from the installation directory to the current directory, so that the experimenter can modify it as needed. * –reference chr5.fas: this part of the command line instructs Mikado on the location of the genome file. * –mode permissive: the mode in which Mikado will treat cases of chimeras. See the :ref:`documentation

<chimera_splitting_algorithm>` for details.
  • –junctions junctions.bed: this part of the command line instructs Mikado to consider this file as the source of reliable splicing junctions.
  • -bt uniprot_sprot_plants.fasta: this part of the command line instructs Mikado to consider this file as the BLAST database which will be used for deriving homology information.

Hint

The –copy-scoring argument is usually not necessary, however, it allows you to easily inspect the

scoring file we are going to use during this run.

Hint

Mikado provides a handful of pre-configured scoring files for different species. However, we do recommend

inspecting and tweaking your scoring file to cater to your species. We provide a guide on how to create your own configuration files here.

3.2. Mikado prepare

The subsequent step involves running mikado prepare to create a sorted, non-redundant GTF with all the input assemblies. As we have already created a configuration file with all the details regarding the input files, this will require us only to issue the command:

mikado prepare --json-conf configuration.yaml

This command will create three files:

  1. mikado_prepared.gtf: one of the two main output files. This is a sorted, non-redundant GTF containing the transcripts from the four input GTFs

  2. mikado_prepared.fasta: a FASTA file of the transcripts present in mikado_prepared.gtf.

  3. prepare.log: the log of this step. This should look like the following, minus the timestamps:

    2016-08-10 13:53:58,443 - prepare - prepare.py:67 - INFO - setup - MainProcess - Command line: /usr/users/ga002/venturil/py351/bin/mikado prepare --json-conf configuration.yaml
    2016-08-10 13:53:58,967 - prepare - prepare.py:206 - INFO - perform_check - MainProcess - Finished to analyse 95 transcripts (93 retained)
    2016-08-10 13:53:58,967 - prepare - prepare.py:405 - INFO - prepare - MainProcess - Finished
    

At the end of this phase, you should have 93 candidate transcripts, as 2 were redundant.

3.3. BLAST of the candidate transcripts

Although it is not strictly necessary, Mikado benefits from integrating homology data from BLAST. Mikado requires this data to be provided either in XML or ASN format (in the latter case, blast_formatter will be used to convert it in-memory to XML).

To create this file, we will proceed as follows:

  1. Uncompress the SwissProt database:

    gzip -dc uniprot_sprot_plants.fasta.gz > uniprot_sprot_plants.fasta
    
  2. Prepare the database for the BLAST:

    makeblastdb -in uniprot_sprot_plants.fasta -dbtype prot -parse_seqids > blast_prepare.log
    
  3. Execute the BLAST, asking for XML output, and compress it to limit space usage.

    blastx -max_target_seqs 5 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send
    
evalue bitscore ppos btop”
-num_threads 10 -query mikado_prepared.fasta -db uniprot_sprot_plants.fasta -out mikado_prepared.blast.tsv

This will produce the mikado_prepared.blast.tsv file, which contains the homology information for the run.

Warning

Mikado requires a custom tabular file from BLAST, as we rely on the information on extra fields such as e.g. btop. Therefore the custom fields following -outfmt 6 are not optional.

3.4. ORF calculation for the transcripts

Many of the metrics used by Mikado to evaluate and rank transcripts rely on the definition or their coding regions (CDS). It is therefore highly recommended to use an ORF predictor to define the coding regions for each transcript identified by mikado prepare. We directly support two different products:

  • Prodigal, a fast ORF predictor, capable of calculating thousands of ORFs in seconds. However, as it was originally developed for ORF calling in bacterial genomes, it may occasionally not provide the best possible answer.
  • TransDecoder, a slower ORF predictor that is however more specialised for eukaryotes.

For this tutorial we are going to use Prodigal. Using it is very straighforward:

prodigal -i mikado_prepared.fasta -g 1 -o mikado.orfs.gff3 -f gff

Warning

Prodigal by default uses the ‘Bacterial’ codon translation table, which is of course not appropriate at all for our eukariote genome. Therefore, it is essential to set -g 1 on the command line. By the same token, as prodigal normally would output the CDS prediction in GenBank format (currently not supported by Mikado), we have to instruct Prodigal to emit its CDS predictions in GFF format.

3.5. Mikado serialise

This step involves running mikado serialise to create a SQLite database with all the information that mikado needs to perform its analysis. As most of the parameters are already specified inside the configuration file, the command line is quite simple:

mikado serialise --json-conf configuration.yaml --xml mikado_prepared.blast.tsv --orfs mikado.orfs.gff3

–blast_targets uniprot_sprot_plants.fasta –junctions junctions.bed

After mikado serialise has run, it will have created two files:

  1. mikado.db, the SQLite database that will be used by pick to perform its analysis.
  2. serialise.log, the log of the run.

If you inspect the SQLite database mikado.db, you will see it contains nine different tables:

$ sqlite3 mikado.db ".tables"
chrom             hit               orf
external          hsp               query
external_sources  junctions         target

These tables contain the information coming from the genome FAI, the BLAST XML, the junctions BED file, the ORFs BED file, and finally the input transcripts and the proteins. There are two additional tables (external and external_sources) which in other runs would contain information on additional data, provided as tabular files.

For more details on the database structure, please refer to the section on this step in this documentation.

3.6. Mikado pick

Finally, during this step mikado pick will integrate the data present in the database with the positional and structural data present in the GTF file to select the best transcript models. The command line to be issued is the following:

mikado pick --configuration configuration.yaml --subloci_out mikado.subloci.gff3

At this step, we have to specify only some parameters for pick to function:

  • –configuration: the configuration file. This is the only compulsory option.
  • –subloci_out: the partial results concerning the subloci step during the selection process will be written to

mikado.subloci.gff3.

mikado pick will produce the following output files:

  • mikado.loci.gff3, mikado.loci.metrics.tsv, mikado.loci.scores.tsv: the proper output files. These contain the location of the selected transcripts, their metrics, and their scores. Please see this section for details.
  • mikado.subloci.gff3, mikado.subloci.metrics.tsv, mikado.subloci.scores.tsv: these files contain the same type of information as those above, but for the subloci stage. As such, all the transcripts in the input files are represented, not just those that are going to be selected as the best.
  • mikado_pick.log: the log file for this operation.

3.7. Comparing files with the reference

Finally, we can compare our files to the original reference annotation, and see how our results are compared to those. To do so, we will use Mikado compare. The first step is to index the reference annotation to make the comparisons faster:

mikado compare -r reference.gff3 --index

This will create a new file, reference.gff3.midx. If you inspect with eg. zless, you will notice it is a SQLite database, describing the locations and components of each gene on the annotation. Now that we have indexed the reference, we can perform the comparisons we are interested in:

  1. Reference vs. the input transcripts:
mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log;
  1. Reference vs. the subloci stage:
mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log;
  1. Reference vs the final output:
mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log;

Each of these comparisons will produce three files:

  • a tmap file, detailing the best match in the reference for each of the query transcripts;
  • a refmap file, detailing the best match among the query transcripts for each of the reference transcripts;
  • a stats file, summarising the comparisons.

The stats file for the input GTF should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log
7 reference RNAs in 5 genes
93 predicted RNAs in  64 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.97  29.39  45.00
            Exon level (stringent): 68.09  18.60  29.22
              Exon level (lenient): 90.91  31.25  46.51
                      Intron level: 94.74  45.57  61.54
                Intron chain level: 16.67  1.59  2.90
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 14.29  1.08  2.00
  Transcript level (>=80% base F1): 14.29  1.08  2.00
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 20.00  1.56  2.90
        Gene level (>=80% base F1): 20.00  1.56  2.90

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 1
             Matched intron chains: 1
   Matching monoexonic transcripts: 0
    Matched monoexonic transcripts: 0
        Total matching transcripts: 1
         Total matched transcripts: 1

          Missed exons (stringent): 15/47  (31.91%)
           Novel exons (stringent): 140/172  (81.40%)
            Missed exons (lenient): 4/44  (9.09%)
             Novel exons (lenient): 88/128  (68.75%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 43/79  (54.43%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/93  (25.81%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 21/64  (32.81%)

For the subloci file, where we still have all the transcripts but we have split obvious chimeras, it should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log
7 reference RNAs in 5 genes
105 predicted RNAs in  26 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 95.96  29.24  44.83
            Exon level (stringent): 70.21  19.08  30.00
              Exon level (lenient): 88.89  32.00  47.06
                      Intron level: 94.74  46.75  62.61
                Intron chain level: 33.33  3.17  5.80
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  9.52  14.29
  Transcript level (>=80% base F1): 42.86  11.43  18.05
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  7.69  12.90
        Gene level (>=80% base F1): 60.00  11.54  19.35

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 3
             Matched intron chains: 2
   Matching monoexonic transcripts: 9
    Matched monoexonic transcripts: 1
        Total matching transcripts: 12
         Total matched transcripts: 3

          Missed exons (stringent): 14/47  (29.79%)
           Novel exons (stringent): 140/173  (80.92%)
            Missed exons (lenient): 5/45  (11.11%)
             Novel exons (lenient): 85/125  (68.00%)
                    Missed introns: 2/38  (5.26%)
                     Novel introns: 41/77  (53.25%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 24/105  (22.86%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 13/26  (50.00%)

A marked improvement can already be seen - we have now 105 transcripts instead of 93, and the total number of matching transcripts has increased from 1 to 3. Precision is still poor, however, as we have not discarded any transcript yet. Moreover, we have redundancy - 9 transcripts match the same monoexonic gene, and 3 transcripts match 2 intron chains in the reference. Finally, the comparison against the proper output (mikado.loci.gff3) should look like this:

Command line:
/usr/local/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log
7 reference RNAs in 5 genes
15 predicted RNAs in  8 genes
--------------------------------- |   Sn |   Pr |   F1 |
                        Base level: 85.74  64.73  73.77
            Exon level (stringent): 63.83  42.86  51.28
              Exon level (lenient): 80.00  52.94  63.72
                      Intron level: 89.47  59.65  71.58
                Intron chain level: 33.33  14.29  20.00
      Transcript level (stringent): 0.00  0.00  0.00
  Transcript level (>=95% base F1): 28.57  13.33  18.18
  Transcript level (>=80% base F1): 42.86  20.00  27.27
         Gene level (100% base F1): 0.00  0.00  0.00
        Gene level (>=95% base F1): 40.00  25.00  30.77
        Gene level (>=80% base F1): 60.00  37.50  46.15

#   Matching: in prediction; matched: in reference.

            Matching intron chains: 2
             Matched intron chains: 2
   Matching monoexonic transcripts: 1
    Matched monoexonic transcripts: 1
        Total matching transcripts: 3
         Total matched transcripts: 3

          Missed exons (stringent): 17/47  (36.17%)
           Novel exons (stringent): 40/70  (57.14%)
            Missed exons (lenient): 9/45  (20.00%)
             Novel exons (lenient): 32/68  (47.06%)
                    Missed introns: 4/38  (10.53%)
                     Novel introns: 23/57  (40.35%)

                Missed transcripts: 0/7  (0.00%)
                 Novel transcripts: 6/15  (40.00%)
                      Missed genes: 0/5  (0.00%)
                       Novel genes: 2/8  (25.00%)

After selecting the best transcripts in each locus, Mikado has discarded most of the incorrect transcripts while retaining most of the correct information; this can be seen in the increase in precision at eg. the nucleotide level (from 30% to 65%). The number of genes has also decreased, as Mikado has discarded many loci whose transcripts are just UTR fragments of neighbouring correct genes.

3.8. Analysing the tutorial data with Snakemake

The workflow described in this tutorial can be executed automatically using Snakemake [Snake] with this Snakefile. Just execute:

snakemake

in the directory where you have downloaded all of the tutorial files. In graph representation, this is how the pipeline looks like:

../_images/snakemake_dag.svg