.. _prepare: Mikado prepare ============== This is the first executive step of the Mikado pipeline. It will accomplish the following goals: #. Collect annotations from disparate annotation files. #. Remove redundant assemblies, ie, assemblies that are *identical* across the various input files. #. Determine the strand of the transcript junctions. #. Ensure uniqueness of the transcript names. #. Order the transcript by locus. #. Extract the transcript sequences. Usage ~~~~~ ``Mikado prepare`` allows to override some of the parameters present in the configuration file through command line options, eg. the input files. Notwithstanding, in the interest of reproducibility we advise to configure everything through the configuration file and supply it to Mikado prepare without further modifications. Available parameters: * *configuration*: the most important parameter. This is the configuration file created through :ref:`Mikado configure `. * *fasta*: reference genome. Required, either through the command line or through the configuration file. * *out*: Output GTF file, with the collapsed transcripts. * *out_fasta*: Output FASTA file of the collapsed transcripts. * *start-method*: :ref:`multiprocessing start method `. * *verbose*, *quiet*: flags to set the verbosity of Mikado prepare. It is generally not advised to turn the verbose mode on, unless there is a problem to debug, given the verbosity of the output. * *strand-specific*: If set, all assemblies will be treated as strand-specific. * *strand-specific-assemblies*: comma-separated list of strand specific assemblies. * *strip-cds*: some aligners (eg GMAP) will try calculate a CDS on the fly for alignments. Use this flag to discard all CDS information from all input transcripts. * *exclude-redundant*: if set, this flag instructs Mikado to look for and simplify redundant intron chains. By default, this option is disabled, or enabled on a per-sample basis. See :ref:`this section for an explanation of redundancy removal in Mikado `, and the section on the list input files below for an explanation on how to set this value on a per-sample basis (recommended). * *codon-table*: Mikado prepare will check the validity of the ORFs of input models. This value indicates which codon table Mikado should use for this purpose. See the section on :ref:`the checks on CDSs `. * *strip-faulty-cds*: when encountering a transcript with an invalid ORF due to e.g. in-frame stop codons, Mikado will usually discard the whole transcript. If this flag is set, Mikado will instead remove the CDS information and leave the transcript in place. * *list*: in alternative to specifying all the information on the command line, it is possible to give to Mikado a *tab-separated* file with details of the files to use. See :ref:`this section for details `. * *log*: log file. Optional, by default Mikado will print to standard error. * *lenient*: flag. If set, multiexonic transcripts without any canonical splice site will be output as well. By default, they would be discarded. * *minimum-cdna-length*: minimum length of the transcripts to be kept, default 200 bps. * *max-intron-size*: maximum length of introns for non-reference transcripts, default 1,000,000 bps. Transcripts with introns longer than this will be split into multiple pieces. **Note**: transcripts marked as "reference" will never be split in this way; Mikado will simply emit a warning in the log. * *seed*: integer seed to use for reproducibility. By default, Mikado will use the seed set in the configuration file. * *random-seed*: boolean switch. If selected, Mikado will use a random seed selected at runtime (and reported in the log) * *single*: flag that disables multiprocessing. Mostly useful for debug purposes. Command line usage: .. code-block:: bash $ mikado prepare --help usage: Mikado prepare [-h] [--fasta REFERENCE] [--verbose | --quiet | -lv {DEBUG,INFO,WARN,ERROR}] [--start-method {fork,spawn,forkserver}] [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST] [-l LOG] [--lenient] [-m MINIMUM_CDNA_LENGTH] [-MI MAX_INTRON_LENGTH] [-p PROCS] [-scds] [--labels LABELS] [--codon-table CODON_TABLE] [--single] [-od OUTPUT_DIR] [-o OUT] [-of OUT_FASTA] [--configuration CONFIGURATION] [-er] [--strip-faulty-cds] [--seed SEED | --random-seed] [gff [gff ...]] positional arguments: gff Input GFF/GTF file(s). optional arguments: -h, --help show this help message and exit --fasta REFERENCE, --reference REFERENCE Genome FASTA file. Required. --verbose --quiet -lv {DEBUG,INFO,WARN,ERROR}, --log-level {DEBUG,INFO,WARN,ERROR} Log level. Default: derived from the configuration; if absent, INFO --start-method {fork,spawn,forkserver} Multiprocessing start method. -s, --strand-specific Flag. If set, monoexonic transcripts will be left on their strand rather than being moved to the unknown strand. -sa STRAND_SPECIFIC_ASSEMBLIES, --strand-specific-assemblies STRAND_SPECIFIC_ASSEMBLIES Comma-delimited list of strand specific assemblies. --list LIST Tab-delimited file containing rows with the following format: