Adding a new program¶
You need to perform a series of tasks to properly add a program to MIP. An overview of the steps can be found here:
- Print program name to
MIPLOGG
andSTDOUT
- Call your custom subroutine (ses below) with relevant parameters
- Writes SBATCH headers
- Figure out i/o files
- Builds out the body of the SBATCH script
- Calls FIDsubmitJob
More details follow below. Chanjo, a program which is part of the coverage analysis, will be used as an example.
Call DefineParameters¶
This subroutine takes a number of input parameters. There are basically three parameter types: “program”, “file”, and “attribute”. Try to group your parameter definitions with related programs.
DefineParameters("pChanjoBuild", "program", 1, "MIP", 0, "nofileEnding", "CoverageReport");
DefineParameters("chanjoBuildDb", "path", "CCDS.current.txt", "pChanjoBuild", "file");
DefineParameters("pChanjoCalculate", "program", 0, "MIP", 0, "nofileEnding", "MAIN");
DefineParameters("chanjoCalculateCutoff", "program", 10, "pChanjoCalculate", 0)
Parameter | Example | Description |
---|---|---|
Name | pChanjoBuild | Program names start with ‘p’ by convention, otherwise it’s up to you. |
Type | program | Can be either program or path. |
Default | 1 | Program: 1/0 as on/off, file: <path to file> or ‘nodefault’, attribute: e.g 10 or ‘nodefault’ |
Associated program | MIP | Typically the program that calls this program. program: usually MIP, file/attribute: <Name>. |
Exists check | 0 | Perform a check that a file is in the reference directory. Either: 0, ‘file’, ‘directory’. |
File ending | nofileEnding | File ending when module is finished. MIP uses this to determine input files downstream in the Chain. file/attribute: skip. |
Chain | MAIN | The chain to which the program belongs to. file/attribute: skip. |
Check install | chanjo | The program handle to check whether it is in the $PATH . file/attribute: skip. |
Command line arguments in GetOptions¶
This is the method that parses the command line input and stores the options. To add your own defined parameters you need to add lines like this:
'<short_option>|<long_option>:<s(tring)/n(umber)>' => \$parameter{'<long_option>'}{'value'},
You should replace anything that looks like <placeholder>
:
'pCh|pChanjoBuild:n' => \$parameter{'pChanjoBuild'}{'value'}, # ChanjoBuild coverage analysis
'chbdb|chanjoBuildDb:s' => \$parameter{'chanjoBuildDb'}{'value'}, # Central SQLite database path
'pCh_C|pChanjoCalculate:n' => \$parameter{'pChanjoCalculate'}{'value'}, # Chanjo coverage analysis
'chccut|chanjoCalculateCutoff:n' => \$parameter{'chanjoCalculateCutoff'}{'value'}, # Cutoff used for completeness
Again, program options begin with a leading “p” by convention. Make sure you don’t cause any naming conflicts.
Lists can also be specified with a special syntax. Basically you need to assign the option to an array instead of $scriptParameters
.
'ifd|inFilesDirs:s' => \@inFilesDirs, #Comma separated list
Later in your code when you would like to access those values you would join on ”,”.
@inFilesDirs = join(',', @inFilesDirs);
Note
MIP doesn’t use True/False flags, all options take at least one argument. For program options it’s possible to turn on (1), off (0) and run programs in dry mode (2). All program options should specify “n(umber)” as argument type.
if-block run checker in MAIN¶
The if-block checks whether the program is set to run but it also has a number of additional responsibilities.
Perhaps the most important is to define dependencies. This is done by placing your if-statement after the closest upsteam process to yours. ChanjoBuild, for example, needs to wait until PicardToolsMarkDuplicates has finished processing the BAM-files before running.
# Closest upsteam dependency for Chanjo
if ($scriptParameter{'pPicardToolsMarkduplicates'} > 0) {
# Body...
}
# This is where Chanjo fits!
if ($scriptParameter{'pChanjoBuild'} > 0) {
# Body...
}
Next (inside the if-block) it should print an announcement to two file handles:
for my $fh (STDOUT, MIPLOGG) { print $fh "\nChanjoBuild\n"; }
Lastly it should call a Custom subroutine, e.g. for each individual sample or per family, which will write a SBATCH script(s), submit them to SLURM, which executes the module.
Note
$sampleInfo
is a hash table storing sample information, for example filename endings from
different stages of the pipeline. It’s used to determine input filenames for your program.
Custom subroutine¶
First up, let’s choose a relevant (and conflict free) name for our subroutine.
sub ChanjoBuild {
# Body...
}
If we pass ALL nessesary variables into the subroutine and assign them as scoped variables it’s easy to overview variables used inside.
my $sampleID = $_[0];
my $familyID = $_[1];
my $aligner = $_[2];
# etc ...
a) SBATCH headers¶
SBATCH headers are written by the ProgramPreRequisites subroutine. It takes a number of input arguments.
ProgramPreRequisites($sampleID, "ChanjoBuild", "$aligner/coverageReport", 0, *CHANJOBUI, 1, $runtimeEst);
Parameter | Example | Description |
---|---|---|
Directory | 11-1-1A | Either a sample ID (e.g. IDN) or family ID depending on where output is stored. |
Program | chanjo | Used in SBATCH script filename. |
Program directory | $aligner/coverageReport |
Defines output directory under Directory. Path should include current aligner by convention. |
Call type | 0 | Options: SNV, INDEL or BOTH. Can be set to: 0 ??? |
File handle | *CHANJO |
The program specific file handle which will be written to when generating the SBATCH script. Always prepend: ‘*’. |
Cores | 1 | The number of cores to allocate. |
Process time | 1.5 | An estimate of the runtime for the particular sample in hours. |
b) Figure out i/o files¶
It’s up to you to figure out where your program should store output files. Basically you need to ask yourself whether putting them in the family/sample foler makes the most sense.
It’s a good idea to first specify both in- and output directories.
my $baseDir = "$outDataDir/$sampleID/$aligner";
my $inDir = $baseDir;
my $outDir = "$baseDir/coverageReport";
If you depend on earlier scripts to generate infile(s) for the new program it’s up to you to figure out the closest program upstream. After that you can ask for the file ending.
my $infileEnding = $sampleInfo{ $familyID }{ $sampleID }{'pPicardToolsMarkduplicates'}{'fileEnding'};
$sampleInfo
is a hash table in global scope.
MIP supports multiple infiles and therefore MIP needs to check if the file(s) have been merge or not.This is done with the CheckIfMergedFiles subroutine, which returns either a 1 (files was merged) or 0 (no merge of files)
my ($infile, $mergeSwitch) = CheckIfMergedFiles($sampleID);
Note
$infilesLaneNoEnding
is a global hash table containing information about the filename-bases (compare filename-endings).
c) Build SBATCH body¶
This is where you fit relevant parameters into your command line tool interface. Print everything to the file handle you defined above.
print CHANJOBUI "
# ------------------------------------------------------------
# Create a temp JSON file with exon coverage annotations
# ------------------------------------------------------------\n";
print CHANJOBUI "chanjo annotate $storePath using $bamFile";
print CHANJOBUI "--cutoff $cutoff";
print CHANJOBUI "--sample $sampleID";
print CHANJOBUI "--group $familyID";
print CHANJOBUI "--json $jsonPath";
# I'm done printing; let's drop the file handle
close(CHANJOBUI);
Note
A wait
command should be added after submitting multiple processes in the same SBATCH script with the &
command. This will ensure SLURM waits for all processes to finish before quitting on the job.
d) Call FIDSubmitJob¶
This subroutine is responsible for actually submitting the SBATCH script and handling dependencies. You should only call this if the program is supposed to run for real (not dry run).
if ( ($runMode == 1) && ($dryRunAll == 0) ) {
# ChanjoBuild is a terminally branching job: linear dependencies/no follow up
FIDSubmitJob($sampleID, $familyID, 2, $parameter{'pChanjoBuild'}{'chain'}, $filename, 0);
}
Parameter | Example | Description |
---|---|---|
Sample ID | 11-1-1A | The sample ID/person IDN |
Family ID | 11 | The family ID |
Dependency type | 2 | Choose between type 0-4 (see below) |
Chain key | $parameter{‘pChanjo’}{‘chain’} | The chain defined in DefineParameters |
SBATCH filename | $filename |
Always use this variable. It automagically points to your SBATCH script file. |
Script tracker | 0 | Huh? Something about parallel processes... |
To figure out which option (integer) to supply as the third argument to FIDSubmitJob you can take a look at this illustration.
Note
$filename
is a variable that is created in ProgramPreRequisites. It points to your freshly composed SBATCH script file and should be supplied to FIDSubmitJob by all custom subroutines.
Note
$parameter{'pChanjoBuild'}{'chain'}
is just the chain that you set in DefineParameters. In this case we could’ve replaced it with “MAIN”.
Further information¶
For your convinience a template program module can be found in the project folder hosted on GitHub. [ADD LINK TO TEMPLATE]