7.4.1. Mikado miscellaneous scripts

All these utilities can be accessed with the mikado util CLI. They perform relatively minor tasks.

7.4.1.1. awk_gtf

This utility is used to retrieve specific regions from a GTF file, without breaking any transcript feature. Any transcript falling even partly within the specified coordinates will be retained in its entirety.

Usage:

$ mikado util awk_gtf --help
usage: mikado.py util awk_gtf [-h] (-r REGION | --chrom CHROM) [-as]
                              [--start START] [--end END]
                              gtf [out]


positional arguments:
  gtf
  out

optional arguments:
  -h, --help            show this help message and exit
  -r REGION, --region REGION
                        Region defined as a string like <chrom>:<start>..<end>
  --chrom CHROM
  -as, --assume-sorted
  --start START
  --end END``

7.4.1.2. class_codes

This utility is used to obtain information about any class code or category thereof.

Usage:

$ mikado util class_codes --help
usage: mikado util class_codes [-h]
                               [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                               [-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]]
                               [-o OUT]
                               [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} ...]]

Script to print out the class codes.

positional arguments:
  {[],=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u}
                        Codes to query.

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
  -c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...], --category {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]
  -o OUT, --out OUT

7.4.1.3. convert

This utility is used to convert between GTF and GFF3 files, with the possibility of giving as output BED12 files as well. It is limited to converting transcript features, and will therefore ignore any other feature present (transposons, loci, etc.). The output of the conversion to GFF3 is completely GFF3 compliant.

Usage:

$ mikado util convert --help
usage: mikado.py util convert [-h] [-of {bed12,gtf,gff3}] gf [out]

positional arguments:
  gf
  out

optional arguments:
  -h, --help            show this help message and exit
  -of {bed12,gtf,gff3}, --out-format {bed12,gtf,gff3}

7.4.1.4. grep

This utility extracts specific transcripts and genes from an input GTF/GFF3 file. As input, it requires a text file of either the format “<transcript id><tab><gene id>”, or simply gene per line (in which case the “–genes” switch has to be invoked). If only some of the transcripts of a gene are included in the text file, the gene feature will be shrunk accordingly. The name is an obvious homage to the invaluable UNIX command that we all love.

Usage:

$ mikado util grep --help
usage: mikado.py util grep [-h] [-v] [--genes] ids gff [out]

Script to extract specific models from GFF/GTF files.

positional arguments:
  ids         ID file (format: mrna_id, gene_id - tab separated)
  gff         The GFF file to parse.
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit
  -v          Exclude from the gff all the records in the id file.
  --genes     Flag. If set, the program expects as ids only a list of genes,
              and will exclude/include all the transcripts children of the
              selected genes.

7.4.1.5. merge_blast

This script merges together various XML BLAST+ files into a single entity. It might be of use when the input data has been chunked into different FASTA files for submission to a cluster queue. It is also capable of converting from ASN files and of dealing with GZipped files.

Usage:

$ mikado util merge_blast --help
usage: mikado.py util merge_blast [-h] [-v] [-l LOG] [--out [OUT]]
                                  xml [xml ...]

positional arguments:
  xml

optional arguments:
  -h, --help         show this help message and exit
  -v, --verbose
  -l LOG, --log LOG
  --out [OUT]

7.4.1.6. metrics

This command generates the documentation regarding the available transcript metrics. It is generated dynamycally by inspecting the code. The documentation in the introduction is generated using this utility.

Usage:

$ mikado util metrics --help
usage: mikado util metrics [-h]
                           [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}]
                           [-o OUT]
                           [-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]]
                           [metric [metric ...]]

Simple script to obtain the documentation on the transcript metrics.

positional arguments:
  metric

optional arguments:
  -h, --help            show this help message and exit
  -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}
                        Format of the table to be printed out.
  -o OUT, --out OUT     Optional output file
  -c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...], --category {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]
                        Available categories to select from.

7.4.1.7. stats

This command generates a statistics file for GFF3/GTF files. The output is a table including Average, Mode, and various quantiles for different features present in a typical GFF file (genes, introns, exons, cDNAs, etc.). The operation can be quite time consuming for large files, in which case it is advisable to ask for multiple processors.

Usage:

$ mikado util stats --help
usage: mikado.py util stats [-h] [--only-coding] [-p PROCS] gff [out]

GFF/GTF statistics script. It will compute median/average length of RNAs,
exons, CDS features, etc.

positional arguments:
  gff                   GFF file to parse.
  out

optional arguments:
  -h, --help            show this help message and exit
  --only-coding
  -p PROCS, --processors PROCS

A typical example statistics file can be found here, for the TAIR10 annotation.

7.4.1.8. trim

This utility trims down the terminal exons of multiexonic transcripts, until either shrinking them to the desired maximum length or meeting the beginning/end of the CDS. It has been used for generating the “trimmed” annotations for the analysis of the original Mikado paper.

Usage:

$ mikado util trim --help
usage: mikado.py util trim [-h] [-ml MAX_LENGTH] [--as-gtf] ann [out]

positional arguments:
  ann                   Reference GTF/GFF output file.
  out

optional arguments:
  -h, --help            show this help message and exit
  -ml MAX_LENGTH, --max_length MAX_LENGTH
                        Maximal length of trimmed terminal exons
  --as-gtf              Flag. If set, the output will be in GTF rather than
                        GFF3 format.

7.4.2. Included scripts

All the following scripts are included in the “util” folder in the source code, and will be included on the PATH after installation. Some of this scripts are used by the The Daijin pipeline for driving Mikado pipeline to produce statistics or perform other intermediate steps.

7.4.2.1. add_transcript_feature_to_gtf.py

This script is needed to add a top-level transcript feature to GTFs that lack it, eg. those produced by CuffMerge [CuffMerge].

Usage:

$ add_transcript_feature_to_gtf.py --help
usage: Script to add a transcript feature to e.g. Cufflinks GTFs
       [-h] gtf [out]

positional arguments:
  gtf         Input GTF
  out         Output file. Default: stdout.

optional arguments:
  -h, --help  show this help message and exit

7.4.2.2. align_collect.py

This script is used to collect statistics from samtools stat. Usage:

$ align_collect.py  --help
usage: Script to collect info from multiple samtools stats files
       [-h] input [input ...]

positional arguments:
  input       The list of samtools stats file to process

optional arguments:
  -h, --help  show this help message and exit

7.4.2.3. asm_collect.py

This script is used to collect statistics obtained with from the mikado util stats utility. Output is printed directly to the screen. Usage:

$ asm_collect.py -h
usage: Script to collect info from multiple mikado util stats files
       [-h] input [input ...]

positional arguments:
  input       The list of mikado util stats file to process

optional arguments:
  -h, --help  show this help message and exit

7.4.2.4. bam2gtf.py

This script will use PySam to convert read alignments into a GTF file. Mostly useful to convert from BAM alignment of long reads (eg. PacBio) into a format which Mikado can interpret and use.

Usage:

$ bam2gtf.py --help
usage: Script to convert from BAM to GTF, for PB alignments [-h] bam [out]

positional arguments:
  bam         Input BAM file
  out         Optional output file

optional arguments:
  -h, --help  show this help message and exit

7.4.2.5. class_run.py

Python3 wrapper for the CLASS [Class2] assembler. It will perform the necessary operations for the assembler (depth and call of the splicing junctions), and launch the program itself. Usage:

$ class_run.py --help
usage: Quick utility to rewrite the wrapper for CLASS. [-h] [--clean]
                                                       [--force]
                                                       [-c CLASS_OPTIONS]
                                                       [-p PROCESSORS]
                                                       [--class_help] [-v]
                                                       [bam] [out]

positional arguments:
  bam                   Input BAM file.
  out                   Optional output file.

optional arguments:
  -h, --help            show this help message and exit
  --clean               Flag. If set, remove tepmorary files.
  --force               Flag. If set, it forces recalculation of all
                        intermediate files.
  -c CLASS_OPTIONS, --class_options CLASS_OPTIONS
                        Additional options to be passed to CLASS. Default: no
                        additional options.
  -p PROCESSORS, --processors PROCESSORS
                        Number of processors to use with class.
  --class_help          If called, the wrapper will ask class to display its
                        help and exit.
  -v, --verbose

7.4.2.6. getFastaFromIds.py

Script to extract a list of sequences from a FASTA file, using the pyfaidx [PyFaidx] module. Usage:

$ getFastaFromIds.py -h
usage: getFastaFromIds.py [-h] [-v] list fasta [out]

A simple script that retrieves the FASTA sequences from a file given a list of
ids.

positional arguments:
  list           File with the list of the ids to recover, one by line.
                 Alternatively, names separated by commas.
  fasta          FASTA file.
  out            Optional output file.

optional arguments:
  -h, --help     show this help message and exit
  -v, --reverse  Retrieve entries which are not in the list, as in grep -v (a
                 homage).

7.4.2.7. gffjunc_to_bed12.py

Script to convert a GFF junction file to a BED12 file. Useful to format the input for Mikado serialise.

Usage:

$ gffjunc_to_bed12.py --help
usage: GFF=>BED12 converter [-h] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help  show this help message and exit

7.4.2.8. grep.py

A script to extract data from column files, using a list of targets. More efficient than a standard “grep -f” for this niche case.

Usage:

$ util/grep.py -h
usage: grep.py [-h] [-v] [-s SEPARATOR] [-f FIELD] [-q] ids target [out]

This script is basically an efficient version of the GNU "grep -f" utility for
table-like files, and functions with a similar sintax.

positional arguments:
  ids                   The file of patterns to extract
  target                The file to filter
  out                   The output file

optional arguments:
  -h, --help            show this help message and exit
  -v, --reverse         Equivalent to the "-v" grep option
  -s SEPARATOR, --separator SEPARATOR
                        The field separator. Default: consecutive
                        whitespace(s)
  -f FIELD, --field FIELD
                        The field to look in the target file.
  -q, --quiet           No logging.

7.4.2.9. merge_junction_bed12.py

This script will merge [Portcullis]-like junctions into a single BED12, using the thick start/ends as unique keys.

Usage:

$ merge_junction_bed12.py --help
usage: Script to merge BED12 files *based on the thickStart/End features*.
    Necessary for merging junction files such as those produced by TopHat
       [-h] [--delim DELIM] [-t THREADS] [--tophat] [-o OUTPUT] bed [bed ...]

positional arguments:
  bed                   Input BED files. Use "-" for stdin.

optional arguments:
  -h, --help            show this help message and exit
  --delim DELIM         Delimiter for merged names. Default: ;
  -t THREADS, --threads THREADS
                        Number of threads to use for multiprocessing. Default:
                        1
  --tophat              Flag. If set, tophat-like junction style is assumed.
                        This means that junctions are defined using the
                        blockSizes rather than thickStart/End. The script will
                        convert the lines to this latter format. By default,
                        the script assumes that the intron start/end are
                        defined using thickStart/End like in portcullis.
                        Mixed-type input files are not supported.
  -o OUTPUT, --output OUTPUT
                        Output file. Default: stdout

7.4.2.10. remove_from_embl.py

Quick script to remove sequences from a given organism from SwissProt files, and print them out in FASTA format. Used to produce the BLAST datasets for the Mikado paper. Usage:

$ remove_from_embl.py -h
usage: Script to remove sequences specific of a given organism from a SwissProt file.
       [-h] -o ORGANISM [--format {fasta}] input [out]

positional arguments:
  input
  out

optional arguments:
  -h, --help            show this help message and exit
  -o ORGANISM, --organism ORGANISM
                        Organism to be excluded
  --format {fasta}      Output format. Choices: fasta. Default: fasta.

7.4.2.11. sanitize_blast_db.py

Simple script to clean the header of FASTA files, so to avoid runtime errors and incrongruencies with BLAST and other tools which might be sensitive to long descriptions or the presence of special characters.

Usage:

$ sanitize_blast_db.py --help
usage: sanitize_blast_db.py [-h] [-o OUT] fasta [fasta ...]

positional arguments:
  fasta

optional arguments:
  -h, --help         show this help message and exit
  -o OUT, --out OUT

7.4.2.12. split_fasta.py

This script is used to split a FASTA file in a fixed number of files, with an approximate equal number of sequences in each. If the number of sequences in the input file is lower than the number of requested splits, the script will create the necessary number of empty files. Used in The Daijin pipeline for driving Mikado for preparing the input data for the BLAST analysis. Usage:

$ split_fasta.py --help
usage: Script to split FASTA sequences in a fixed number of multiple files.
       [-h] [-m NUM_FILES] fasta [out]

positional arguments:
  fasta                 Input FASTA file.
  out                   Output prefix. Default: filename+split

optional arguments:
  -h, --help            show this help message and exit
  -m NUM_FILES, --num-files NUM_FILES
                        Number of files to create. Default: 1000

7.4.2.13. trim_long_introns.py

This script parses an annotation file and truncates any transcript which has UTR introns over the provided threshold. In such cases, the UTR section after the long intron is simply removed. Usage:

$ trim_long_introns.py --help
usage: This script truncates transcript with UTR exons separated by long introns.
       [-h] [-mi MAX_INTRON] gff [out]

positional arguments:
  gff
  out

optional arguments:
  -h, --help            show this help message and exit
  -mi MAX_INTRON, --max-intron MAX_INTRON
                        Maximum intron length for UTR introns.