.. _Tutorial: Tutorial ======== This tutorial will guide you through a simple analysis of Mikado, using a small amount of data coming from an experiment on *Arabidopsis thaliana*. RNA-Seq data was obtained from `study PRJEB7093 on ENA `_, aligned with STAR [STAR]_ against the `TAIR10 `_ reference genome, and assembled with four different programs. For this small example, we are going to focus on a small genomic region: Chr5, from 26,575,364 to 26,614,205. During this analysis, you will require the following files: * :download:`chr5.fas.gz `: a FASTA file containing the Chr5 of *A. thaliana*. * :download:`reference.gff3`: a GFF3 file with the annotation of the genomic slice we are interested in, for comparison purposes. * :download:`junctions.bed `: a BED12 file of reliable splicing junctions in the region, identified using Portcullis [Portcullis]_ * :download:`class.gtf`: a GTF file of transcripts assembled using CLASS [Class2]_ * :download:`cufflinks.gtf`: a GTF file of transcripts assembled using Cufflinks [Cufflinks]_ * :download:`stringtie.gtf`: a GTF file of transcripts assembled using Stringtie [StringTie]_ * :download:`trinity.gff3`: a GFF3 file of transcripts assembled using Trinity [Trinity]_ and aligned using GMAP [GMAP]_ * :download:`orfs.bed`: a BED12 file containing the ORFs of the above transcripts, derived using TransDecoder [Trinity]_ * :download:`uniprot_sprot_plants.fasta.gz`: a FASTA file containing the plant proteins released with SwissProt [Uniprot]_ All of this data can also be found in the ``sample_data`` directory of the `Mikado source `_. You will also require the following software: * a functioning installation of SQLite. * a functioning version of BLAST+ [Blastplus]_. * a functioning version of Prodigal [Prodigal]_. Creating the configuration file for Mikado ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the first step, we need to create a configuration file to drive Mikado. To do so, we will first create a tab-delimited file describing our assemblies (class.gtf, cufflinks.gtf, stringtie.gtf, trinity.gff3):: class.gtf cl True False False True cufflinks.gtf cuff True False False True stringtie.gtf st True 1 False True True trinity.gff3 tr False -0.5 False False True reference.gff3 at True 5 True False False pacbio.bam pb True 1 False False False In this file, the first three fields define the following: #. The file location and name (if no folder is specified, Mikado will look for each file in the current working directory) #. An alias associated with the file, which has to be unique #. A binary flag (``True`` / ``False``) indicating whether the assembly is strand-specific or not These fields are then followed by a series of **optional** fields: #. A score associated with that sample. All transcripts associated with the label will have their score corrected by the value on this field. So eg. in this example all Stringtie models will receive an additional point, and all Trinity models will be penalised by half a point. Class and Cufflinks have no special bonus or malus associated. #. A binary flag (``True`` / ``False``) defining whether the sample is a reference or not. #. A binary flag (``True`` / ``False``) defining whether to exclude redundant models or not. #. A binary flag (``True`` / ``False``) indicating whether Mikado prepare should strip the CDS of faulty models, but otherwise keep their cDNA structure in the final output (``True``) or whether instead it should completely discard such models (``False``). #. A binary flag (``True`` / ``False``) instructing Mikado about whether the chimera split routine should be skipped for these models (``True``) or if instead it should proceed normally (``False``). Finally, we will create the configuration file itself using ``mikado configure``: .. code-block:: bash mikado configure --list list.txt --reference chr5.fas.gz --mode permissive --scoring plants.yaml --copy-scoring plants.yaml --junctions junctions.bed -bt uniprot_sprot_plants.fasta configuration.yaml This will create a configuration.yaml file with the parameters that were specified on the command line. This is :ref:`simplified configuration file `, containing all the necessary parameters for the Mikado run. It will also copy the ``plants.yaml`` file from the Mikado installation to your current working directory. .. hint:: Mikado can accept compressed genome FASTA files, like in this example, as long as they have been compressed with BGZip rather than the vanilla UNIX GZip. * *--list list.txt*: this part of the command line instructs Mikado to read the file we just created to understand where the input files are and how to treat them. * *--scoring*: the scoring file to use. Mikado ships with two pre-calculated scoring files, `plant.yaml` and `mammalian.yaml` * *--copy-scoring*: instruct Mikado to copy the scoring file from the installation directory to the current directory, so that the experimenter can modify it as needed. * *--reference chr5.fas*: this part of the command line instructs Mikado on the location of the genome file. * *--mode permissive*: the mode in which Mikado will treat cases of chimeras. See the :ref:`documentation ` for details. * *--junctions junctions.bed*: this part of the command line instructs Mikado to consider this file as the source of reliable splicing junctions. * *-bt uniprot_sprot_plants.fasta*: this part of the command line instructs Mikado to consider this file as the BLAST database which will be used for deriving homology information. .. hint:: The *--copy-scoring* argument is usually not necessary, however, it allows you to easily inspect the :ref:`scoring file ` we are going to use during this run. .. hint:: Mikado provides a handful of pre-configured scoring files for different species. However, we do recommend inspecting and tweaking your scoring file to cater to your species. We provide a guide on how to create your own configuration files :ref:`here `. Mikado prepare ~~~~~~~~~~~~~~ The subsequent step involves running ``mikado prepare`` to create a :ref:`sorted, non-redundant GTF with all the input assemblies `. As we have already created a configuration file with all the details regarding the input files, this will require us only to issue the command: .. code-block:: bash mikado prepare --json-conf configuration.yaml This command will create three files: #. *mikado_prepared.gtf*: one of the two main output files. This is a sorted, non-redundant GTF containing the transcripts from the four input GTFs #. *mikado_prepared.fasta*: a FASTA file of the transcripts present in *mikado_prepared.gtf*. #. *prepare.log*: the log of this step. This should look like the following, minus the timestamps:: 2016-08-10 13:53:58,443 - prepare - prepare.py:67 - INFO - setup - MainProcess - Command line: /usr/users/ga002/venturil/py351/bin/mikado prepare --json-conf configuration.yaml 2016-08-10 13:53:58,967 - prepare - prepare.py:206 - INFO - perform_check - MainProcess - Finished to analyse 95 transcripts (93 retained) 2016-08-10 13:53:58,967 - prepare - prepare.py:405 - INFO - prepare - MainProcess - Finished At the end of this phase, you should have 93 candidate transcripts, as 2 were redundant. BLAST of the candidate transcripts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Although it is not strictly necessary, Mikado benefits from integrating homology data from BLAST. Mikado requires this data to be provided either in XML or ASN format (in the latter case, ``blast_formatter`` will be used to convert it in-memory to XML). To create this file, we will proceed as follows: #. Uncompress the SwissProt database: .. code-block:: bash gzip -dc uniprot_sprot_plants.fasta.gz > uniprot_sprot_plants.fasta #. Prepare the database for the BLAST: .. code-block:: bash makeblastdb -in uniprot_sprot_plants.fasta -dbtype prot -parse_seqids > blast_prepare.log #. Execute the BLAST, asking for XML output, and compress it to limit space usage. .. code-block:: bash blastx -max_target_seqs 5 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos btop" -num_threads 10 -query mikado_prepared.fasta -db uniprot_sprot_plants.fasta -out mikado_prepared.blast.tsv This will produce the ``mikado_prepared.blast.tsv`` file, which contains the homology information for the run. .. warning:: Mikado requires a **custom** tabular file from BLAST, as we rely on the information on extra fields such as e.g. ``btop``. Therefore the custom fields following ``-outfmt 6`` are **not** optional. ORF calculation for the transcripts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Many of the metrics used by Mikado to evaluate and rank transcripts rely on the definition or their coding regions (CDS). It is therefore *highly recommended* to use an ORF predictor to define the coding regions for each transcript identified by `mikado prepare`. We directly support two different products: * :ref:`Prodigal `, a fast ORF predictor, capable of calculating thousands of ORFs in seconds. However, as it was originally developed for ORF calling in bacterial genomes, it may occasionally not provide the best possible answer. * :ref:`TransDecoder `, a slower ORF predictor that is however more specialised for eukaryotes. For this tutorial we are going to use Prodigal. Using it is very straighforward: .. code-block:: bash prodigal -i mikado_prepared.fasta -g 1 -o mikado.orfs.gff3 -f gff .. warning:: Prodigal by default uses the 'Bacterial' codon translation table, which is of course not appropriate at all for our eukariote genome. Therefore, it is essential to set ``-g 1`` on the command line. By the same token, as prodigal normally would output the CDS prediction in GenBank format (currently not supported by Mikado), we have to instruct Prodigal to emit its CDS predictions in GFF format. Mikado serialise ~~~~~~~~~~~~~~~~ This step involves running ``mikado serialise`` to create a SQLite database with all the information that mikado needs to perform its analysis. As most of the parameters are already specified inside the configuration file, the command line is quite simple: .. code-block:: bash mikado serialise --json-conf configuration.yaml --xml mikado_prepared.blast.tsv --orfs mikado.orfs.gff3 --blast_targets uniprot_sprot_plants.fasta --junctions junctions.bed After mikado serialise has run, it will have created two files: #. ``mikado.db``, the SQLite database that will be used by ``pick`` to perform its analysis. #. ``serialise.log``, the log of the run. If you inspect the SQLite database ``mikado.db``, you will see it contains nine different tables:: $ sqlite3 mikado.db ".tables" chrom hit orf external hsp query external_sources junctions target These tables contain the information coming from the genome FAI, the BLAST XML, the junctions BED file, the ORFs BED file, and finally the input transcripts and the proteins. There are two additional tables (``external`` and ``external_sources``) which in other runs would contain information on additional data, provided as tabular files. For more details on the database structure, please refer to the section on :ref:`this step ` in this documentation. Mikado pick ~~~~~~~~~~~ Finally, during this step ``mikado pick`` will integrate the data present in the database with the positional and structural data present in the GTF file :ref:`to select the best transcript models `. The command line to be issued is the following: .. code-block:: bash mikado pick --configuration configuration.yaml --subloci_out mikado.subloci.gff3 At this step, we have to specify only some parameters for ``pick`` to function: * *--configuration*: the configuration file. This is the only compulsory option. * *--subloci_out*: the partial results concerning the *subloci* step during the selection process will be written to ``mikado.subloci.gff3``. ``mikado pick`` will produce the following output files: * ``mikado.loci.gff3``, ``mikado.loci.metrics.tsv``, ``mikado.loci.scores.tsv``: the proper output files. These contain the location of the selected transcripts, their metrics, and their scores. Please see :ref:`this section for details `. * ``mikado.subloci.gff3``, ``mikado.subloci.metrics.tsv``, ``mikado.subloci.scores.tsv``: these files contain the same type of information as those above, but for the *subloci* stage. As such, all the transcripts in the input files are represented, not just those that are going to be selected as the best. * *mikado_pick.log*: the log file for this operation. Comparing files with the reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Finally, we can compare our files to the original reference annotation, and see how our results are compared to those. To do so, we will use :ref:`Mikado compare `. The first step is to index the reference annotation to make the comparisons faster: .. code-block:: bash mikado compare -r reference.gff3 --index This will create a new file, ``reference.gff3.midx``. If you inspect with eg. ``zless``, you will notice it is a SQLite database, describing the locations and components of each gene on the annotation. Now that we have indexed the reference, we can perform the comparisons we are interested in: #. Reference vs. the input transcripts: .. code-block:: bash mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log; #. Reference vs. the subloci stage: .. code-block:: bash mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log; #. Reference vs the final output: .. code-block:: bash mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log; Each of these comparisons will produce three files: * a *tmap* file, detailing the best match in the reference for each of the query transcripts; * a *refmap* file, detailing the best match among the query transcripts for each of the reference transcripts; * a *stats* file, summarising the comparisons. The *stats* file for the input GTF should look like this:: Command line: /usr/local/bin/mikado compare -r reference.gff3 -p mikado_prepared.gtf -o compare_input -l compare_input.log 7 reference RNAs in 5 genes 93 predicted RNAs in 64 genes --------------------------------- | Sn | Pr | F1 | Base level: 95.97 29.39 45.00 Exon level (stringent): 68.09 18.60 29.22 Exon level (lenient): 90.91 31.25 46.51 Intron level: 94.74 45.57 61.54 Intron chain level: 16.67 1.59 2.90 Transcript level (stringent): 0.00 0.00 0.00 Transcript level (>=95% base F1): 14.29 1.08 2.00 Transcript level (>=80% base F1): 14.29 1.08 2.00 Gene level (100% base F1): 0.00 0.00 0.00 Gene level (>=95% base F1): 20.00 1.56 2.90 Gene level (>=80% base F1): 20.00 1.56 2.90 # Matching: in prediction; matched: in reference. Matching intron chains: 1 Matched intron chains: 1 Matching monoexonic transcripts: 0 Matched monoexonic transcripts: 0 Total matching transcripts: 1 Total matched transcripts: 1 Missed exons (stringent): 15/47 (31.91%) Novel exons (stringent): 140/172 (81.40%) Missed exons (lenient): 4/44 (9.09%) Novel exons (lenient): 88/128 (68.75%) Missed introns: 2/38 (5.26%) Novel introns: 43/79 (54.43%) Missed transcripts: 0/7 (0.00%) Novel transcripts: 24/93 (25.81%) Missed genes: 0/5 (0.00%) Novel genes: 21/64 (32.81%) For the *subloci* file, where we still have all the transcripts but we have split obvious chimeras, it should look like this:: Command line: /usr/local/bin/mikado compare -r reference.gff3 -p mikado.subloci.gff3 -o compare_subloci -l compare_subloci.log 7 reference RNAs in 5 genes 105 predicted RNAs in 26 genes --------------------------------- | Sn | Pr | F1 | Base level: 95.96 29.24 44.83 Exon level (stringent): 70.21 19.08 30.00 Exon level (lenient): 88.89 32.00 47.06 Intron level: 94.74 46.75 62.61 Intron chain level: 33.33 3.17 5.80 Transcript level (stringent): 0.00 0.00 0.00 Transcript level (>=95% base F1): 28.57 9.52 14.29 Transcript level (>=80% base F1): 42.86 11.43 18.05 Gene level (100% base F1): 0.00 0.00 0.00 Gene level (>=95% base F1): 40.00 7.69 12.90 Gene level (>=80% base F1): 60.00 11.54 19.35 # Matching: in prediction; matched: in reference. Matching intron chains: 3 Matched intron chains: 2 Matching monoexonic transcripts: 9 Matched monoexonic transcripts: 1 Total matching transcripts: 12 Total matched transcripts: 3 Missed exons (stringent): 14/47 (29.79%) Novel exons (stringent): 140/173 (80.92%) Missed exons (lenient): 5/45 (11.11%) Novel exons (lenient): 85/125 (68.00%) Missed introns: 2/38 (5.26%) Novel introns: 41/77 (53.25%) Missed transcripts: 0/7 (0.00%) Novel transcripts: 24/105 (22.86%) Missed genes: 0/5 (0.00%) Novel genes: 13/26 (50.00%) A marked improvement can already be seen - we have now 105 transcripts instead of 93, and the total number of matching transcripts has increased from 1 to 3. Precision is still poor, however, as we have not discarded any transcript yet. Moreover, we have redundancy - 9 transcripts match the same monoexonic gene, and 3 transcripts match 2 intron chains in the reference. Finally, the comparison against the proper output (``mikado.loci.gff3``) should look like this:: Command line: /usr/local/bin/mikado compare -r reference.gff3 -p mikado.loci.gff3 -o compare -l compare.log 7 reference RNAs in 5 genes 15 predicted RNAs in 8 genes --------------------------------- | Sn | Pr | F1 | Base level: 85.74 64.73 73.77 Exon level (stringent): 63.83 42.86 51.28 Exon level (lenient): 80.00 52.94 63.72 Intron level: 89.47 59.65 71.58 Intron chain level: 33.33 14.29 20.00 Transcript level (stringent): 0.00 0.00 0.00 Transcript level (>=95% base F1): 28.57 13.33 18.18 Transcript level (>=80% base F1): 42.86 20.00 27.27 Gene level (100% base F1): 0.00 0.00 0.00 Gene level (>=95% base F1): 40.00 25.00 30.77 Gene level (>=80% base F1): 60.00 37.50 46.15 # Matching: in prediction; matched: in reference. Matching intron chains: 2 Matched intron chains: 2 Matching monoexonic transcripts: 1 Matched monoexonic transcripts: 1 Total matching transcripts: 3 Total matched transcripts: 3 Missed exons (stringent): 17/47 (36.17%) Novel exons (stringent): 40/70 (57.14%) Missed exons (lenient): 9/45 (20.00%) Novel exons (lenient): 32/68 (47.06%) Missed introns: 4/38 (10.53%) Novel introns: 23/57 (40.35%) Missed transcripts: 0/7 (0.00%) Novel transcripts: 6/15 (40.00%) Missed genes: 0/5 (0.00%) Novel genes: 2/8 (25.00%) After selecting the best transcripts in each locus, Mikado has discarded most of the incorrect transcripts while retaining most of the correct information; this can be seen in the increase in precision at eg. the nucleotide level (from 30% to 65%). The number of genes has also decreased, as Mikado has discarded many loci whose transcripts are just UTR fragments of neighbouring correct genes. Analysing the tutorial data with Snakemake ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The workflow described in this tutorial can be executed automatically using Snakemake [Snake]_ with :download:`this Snakefile `. Just execute: .. code-block:: bash snakemake in the directory where you have downloaded all of the tutorial files. In graph representation, this is how the pipeline looks like: .. figure:: snakemake_dag.svg :align: center :scale: 50% :figwidth: 100%