.. _utils: Mikado miscellaneous scripts ============================ All these utilities can be accessed with the ``mikado util`` CLI. They perform relatively minor tasks. awk_gtf ~~~~~~~ This utility is used to retrieve specific regions from a GTF file, without breaking any transcript feature. Any transcript falling even partly within the specified coordinates will be retained in its entirety. Usage:: $ mikado util awk_gtf --help usage: mikado.py util awk_gtf [-h] (-r REGION | --chrom CHROM) [-as] [--start START] [--end END] gtf [out] positional arguments: gtf out optional arguments: -h, --help show this help message and exit -r REGION, --region REGION Region defined as a string like :.. --chrom CHROM -as, --assume-sorted --start START --end END`` .. _class-codes-command: class_codes ~~~~~~~~~~~ This utility is used to obtain information about any class code or category thereof. Usage:: $ mikado util class_codes --help usage: mikado util class_codes [-h] [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}] [-c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...]] [-o OUT] [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} [{,=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} ...]] Script to print out the class codes. positional arguments: {[],=,_,n,J,c,C,j,h,g,G,o,e,m,i,I,ri,rI,f,x,X,p,P,u} Codes to query. optional arguments: -h, --help show this help message and exit -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv} -c {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...], --category {Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} [{Intronic,Match,Alternative splicing,Unknown,Fragment,Overlap,Extension,Fusion} ...] -o OUT, --out OUT convert ~~~~~~~ This utility is used to convert between GTF and GFF3 files, with the possibility of giving as output BED12 files as well. It is limited to converting transcript features, and will therefore ignore any other feature present (transposons, loci, etc.). The output of the conversion to GFF3 is completely GFF3 compliant. Usage:: $ mikado util convert --help usage: mikado.py util convert [-h] [-of {bed12,gtf,gff3}] gf [out] positional arguments: gf out optional arguments: -h, --help show this help message and exit -of {bed12,gtf,gff3}, --out-format {bed12,gtf,gff3} .. _grep-command: grep ~~~~ This utility extracts specific transcripts and genes from an input GTF/GFF3 file. As input, it requires a text file of either the format "", or simply gene per line (in which case the "--genes" switch has to be invoked). If only some of the transcripts of a gene are included in the text file, the gene feature will be shrunk accordingly. The name is an obvious homage to the invaluable UNIX command that we all love. Usage:: $ mikado util grep --help usage: mikado.py util grep [-h] [-v] [--genes] ids gff [out] Script to extract specific models from GFF/GTF files. positional arguments: ids ID file (format: mrna_id, gene_id - tab separated) gff The GFF file to parse. out Optional output file optional arguments: -h, --help show this help message and exit -v Exclude from the gff all the records in the id file. --genes Flag. If set, the program expects as ids only a list of genes, and will exclude/include all the transcripts children of the selected genes. .. _merge-blast-command: merge_blast ~~~~~~~~~~~ This script merges together various XML BLAST+ files into a single entity. It might be of use when the input data has been chunked into different FASTA files for submission to a cluster queue. It is also capable of converting from ASN files and of dealing with GZipped files. Usage:: $ mikado util merge_blast --help usage: mikado.py util merge_blast [-h] [-v] [-l LOG] [--out [OUT]] xml [xml ...] positional arguments: xml optional arguments: -h, --help show this help message and exit -v, --verbose -l LOG, --log LOG --out [OUT] .. _metrics-command: metrics ~~~~~~~ This command generates the documentation regarding the available transcript metrics. It is generated dynamycally by inspecting the code. The documentation in the :ref:`introduction ` is generated using this utility. Usage:: $ mikado util metrics --help usage: mikado util metrics [-h] [-f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}] [-o OUT] [-c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...]] [metric [metric ...]] Simple script to obtain the documentation on the transcript metrics. positional arguments: metric optional arguments: -h, --help show this help message and exit -f {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv}, --format {fancy_grid,grid,html,jira,latex,latex_booktabs,mediawiki,moinmoin,orgtbl,pipe,plain,psql,rst,simple,textile,tsv} Format of the table to be printed out. -o OUT, --out OUT Optional output file -c {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...], --category {CDS,Descriptive,External,Intron,Locus,UTR,cDNA} [{CDS,Descriptive,External,Intron,Locus,UTR,cDNA} ...] Available categories to select from. .. _stat-command: stats ~~~~~ This command generates a statistics file for GFF3/GTF files. The output is a table including Average, Mode, and various quantiles for different features present in a typical GFF file (genes, introns, exons, cDNAs, etc.). The operation can be quite time consuming for large files, in which case it is advisable to ask for multiple processors. Usage:: $ mikado util stats --help usage: mikado.py util stats [-h] [--only-coding] [-p PROCS] gff [out] GFF/GTF statistics script. It will compute median/average length of RNAs, exons, CDS features, etc. positional arguments: gff GFF file to parse. out optional arguments: -h, --help show this help message and exit --only-coding -p PROCS, --processors PROCS A typical example statistics file can be found :download:`here, for the TAIR10 annotation <./TAIR10.stats>`. .. _trim-command: trim ~~~~ This utility trims down the terminal exons of multiexonic transcripts, until either shrinking them to the desired maximum length or meeting the beginning/end of the CDS. It has been used for generating the "trimmed" annotations for the analysis of the original Mikado paper. Usage:: $ mikado util trim --help usage: mikado.py util trim [-h] [-ml MAX_LENGTH] [--as-gtf] ann [out] positional arguments: ann Reference GTF/GFF output file. out optional arguments: -h, --help show this help message and exit -ml MAX_LENGTH, --max_length MAX_LENGTH Maximal length of trimmed terminal exons --as-gtf Flag. If set, the output will be in GTF rather than GFF3 format. .. _included_scripts: Included scripts ================ All the following scripts are included in the "util" folder in the source code, and will be included on the PATH after installation. Some of this scripts are used by the :ref:`Daijin` pipeline to produce statistics or perform other intermediate steps. add_transcript_feature_to_gtf.py ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This script is needed to add a top-level transcript feature to GTFs that lack it, eg. those produced by CuffMerge [CuffMerge]_. Usage:: $ add_transcript_feature_to_gtf.py --help usage: Script to add a transcript feature to e.g. Cufflinks GTFs [-h] gtf [out] positional arguments: gtf Input GTF out Output file. Default: stdout. optional arguments: -h, --help show this help message and exit align_collect.py ~~~~~~~~~~~~~~~~ This script is used to collect statistics from `samtools stat `_. Usage:: $ align_collect.py --help usage: Script to collect info from multiple samtools stats files [-h] input [input ...] positional arguments: input The list of samtools stats file to process optional arguments: -h, --help show this help message and exit asm_collect.py ~~~~~~~~~~~~~~ This script is used to collect statistics obtained with from the :ref:`mikado util stats ` utility. Output is printed directly to the screen. Usage:: $ asm_collect.py -h usage: Script to collect info from multiple mikado util stats files [-h] input [input ...] positional arguments: input The list of mikado util stats file to process optional arguments: -h, --help show this help message and exit bam2gtf.py ~~~~~~~~~~ This script will use PySam to convert read alignments into a GTF file. Mostly useful to convert from BAM alignment of long reads (eg. PacBio) into a format which Mikado can interpret and use. Usage:: $ bam2gtf.py --help usage: Script to convert from BAM to GTF, for PB alignments [-h] bam [out] positional arguments: bam Input BAM file out Optional output file optional arguments: -h, --help show this help message and exit class_run.py ~~~~~~~~~~~~ Python3 wrapper for the CLASS [Class2]_ assembler. It will perform the necessary operations for the assembler (depth and call of the splicing junctions), and launch the program itself. Usage:: $ class_run.py --help usage: Quick utility to rewrite the wrapper for CLASS. [-h] [--clean] [--force] [-c CLASS_OPTIONS] [-p PROCESSORS] [--class_help] [-v] [bam] [out] positional arguments: bam Input BAM file. out Optional output file. optional arguments: -h, --help show this help message and exit --clean Flag. If set, remove tepmorary files. --force Flag. If set, it forces recalculation of all intermediate files. -c CLASS_OPTIONS, --class_options CLASS_OPTIONS Additional options to be passed to CLASS. Default: no additional options. -p PROCESSORS, --processors PROCESSORS Number of processors to use with class. --class_help If called, the wrapper will ask class to display its help and exit. -v, --verbose getFastaFromIds.py ~~~~~~~~~~~~~~~~~~ Script to extract a list of sequences from a FASTA file, using the `pyfaidx `_ [PyFaidx]_ module. Usage:: $ getFastaFromIds.py -h usage: getFastaFromIds.py [-h] [-v] list fasta [out] A simple script that retrieves the FASTA sequences from a file given a list of ids. positional arguments: list File with the list of the ids to recover, one by line. Alternatively, names separated by commas. fasta FASTA file. out Optional output file. optional arguments: -h, --help show this help message and exit -v, --reverse Retrieve entries which are not in the list, as in grep -v (a homage). gffjunc_to_bed12.py ~~~~~~~~~~~~~~~~~~~ Script to convert a GFF junction file to a BED12 file. Useful to format the input for Mikado serialise. Usage:: $ gffjunc_to_bed12.py --help usage: GFF=>BED12 converter [-h] gff [out] positional arguments: gff out optional arguments: -h, --help show this help message and exit grep.py ~~~~~~~ A script to extract data from *column* files, using a list of targets. More efficient than a standard "grep -f" for this niche case. Usage:: $ util/grep.py -h usage: grep.py [-h] [-v] [-s SEPARATOR] [-f FIELD] [-q] ids target [out] This script is basically an efficient version of the GNU "grep -f" utility for table-like files, and functions with a similar sintax. positional arguments: ids The file of patterns to extract target The file to filter out The output file optional arguments: -h, --help show this help message and exit -v, --reverse Equivalent to the "-v" grep option -s SEPARATOR, --separator SEPARATOR The field separator. Default: consecutive whitespace(s) -f FIELD, --field FIELD The field to look in the target file. -q, --quiet No logging. merge_junction_bed12.py ~~~~~~~~~~~~~~~~~~~~~~~ This script will merge [Portcullis]_-like junctions into a single BED12, using the thick start/ends as unique keys. Usage:: $ merge_junction_bed12.py --help usage: Script to merge BED12 files *based on the thickStart/End features*. Necessary for merging junction files such as those produced by TopHat [-h] [--delim DELIM] [-t THREADS] [--tophat] [-o OUTPUT] bed [bed ...] positional arguments: bed Input BED files. Use "-" for stdin. optional arguments: -h, --help show this help message and exit --delim DELIM Delimiter for merged names. Default: ; -t THREADS, --threads THREADS Number of threads to use for multiprocessing. Default: 1 --tophat Flag. If set, tophat-like junction style is assumed. This means that junctions are defined using the blockSizes rather than thickStart/End. The script will convert the lines to this latter format. By default, the script assumes that the intron start/end are defined using thickStart/End like in portcullis. Mixed-type input files are not supported. -o OUTPUT, --output OUTPUT Output file. Default: stdout remove_from_embl.py ~~~~~~~~~~~~~~~~~~~ Quick script to remove sequences from a given organism from SwissProt files, and print them out in FASTA format. Used to produce the BLAST datasets for the Mikado paper. Usage:: $ remove_from_embl.py -h usage: Script to remove sequences specific of a given organism from a SwissProt file. [-h] -o ORGANISM [--format {fasta}] input [out] positional arguments: input out optional arguments: -h, --help show this help message and exit -o ORGANISM, --organism ORGANISM Organism to be excluded --format {fasta} Output format. Choices: fasta. Default: fasta. sanitize_blast_db.py ~~~~~~~~~~~~~~~~~~~~ Simple script to clean the header of FASTA files, so to avoid runtime errors and incrongruencies with BLAST and other tools which might be sensitive to long descriptions or the presence of special characters. Usage:: $ sanitize_blast_db.py --help usage: sanitize_blast_db.py [-h] [-o OUT] fasta [fasta ...] positional arguments: fasta optional arguments: -h, --help show this help message and exit -o OUT, --out OUT split_fasta.py ~~~~~~~~~~~~~~ This script is used to split a FASTA file in a fixed number of files, with an approximate equal number of sequences in each. If the number of sequences in the input file is lower than the number of requested splits, the script will create the necessary number of empty files. Used in :ref:`Daijin` for preparing the input data for the BLAST analysis. Usage:: $ split_fasta.py --help usage: Script to split FASTA sequences in a fixed number of multiple files. [-h] [-m NUM_FILES] fasta [out] positional arguments: fasta Input FASTA file. out Output prefix. Default: filename+split optional arguments: -h, --help show this help message and exit -m NUM_FILES, --num-files NUM_FILES Number of files to create. Default: 1000 trim_long_introns.py ~~~~~~~~~~~~~~~~~~~~ This script parses an annotation file and truncates any transcript which has *UTR* introns over the provided threshold. In such cases, the UTR section after the long intron is simply removed. Usage:: $ trim_long_introns.py --help usage: This script truncates transcript with UTR exons separated by long introns. [-h] [-mi MAX_INTRON] gff [out] positional arguments: gff out optional arguments: -h, --help show this help message and exit -mi MAX_INTRON, --max-intron MAX_INTRON Maximum intron length for UTR introns.