Adding a new program

You need to perform a series of tasks to properly add a program to MIP. An overview of the steps can be found here:

  1. Call DefineParameters
  2. Command line arguments in GetOptions
  3. if-block run checker in MAIN
  1. Print program name to MIPLOGG and STDOUT
  2. Call your custom subroutine (ses below) with relevant parameters
  1. Custom subroutine
  1. Writes SBATCH headers
  2. Figure out i/o files
  3. Builds out the body of the SBATCH script
  4. Calls FIDsubmitJob

More details follow below. Chanjo, a program which is part of the coverage analysis, will be used as an example.

Call DefineParameters

This subroutine takes a number of input parameters. There are basically three parameter types: “program”, “file”, and “attribute”. Try to group your parameter definitions with related programs.

DefineParameters("pChanjoBuild", "program", 1, "MIP", 0, "nofileEnding", "CoverageReport");

DefineParameters("chanjoBuildDb", "path", "CCDS.current.txt", "pChanjoBuild", "file");

DefineParameters("pChanjoCalculate", "program", 0, "MIP", 0, "nofileEnding", "MAIN");

DefineParameters("chanjoCalculateCutoff", "program", 10, "pChanjoCalculate", 0)
DefineParameters - parameters
Parameter Example Description
Name pChanjoBuild Program names start with ‘p’ by convention, otherwise it’s up to you.
Type program Can be either program or path.
Default 1 Program: 1/0 as on/off, file: <path to file> or ‘nodefault’, attribute: e.g 10 or ‘nodefault’
Associated program MIP Typically the program that calls this program. program: usually MIP, file/attribute: <Name>.
Exists check 0 Perform a check that a file is in the reference directory. Either: 0, ‘file’, ‘directory’.
File ending nofileEnding File ending when module is finished. MIP uses this to determine input files downstream in the Chain. file/attribute: skip.
Chain MAIN The chain to which the program belongs to. file/attribute: skip.
Check install chanjo The program handle to check whether it is in the $PATH. file/attribute: skip.

Command line arguments in GetOptions

This is the method that parses the command line input and stores the options. To add your own defined parameters you need to add lines like this:

'<short_option>|<long_option>:<s(tring)/n(umber)>' => \$parameter{'<long_option>'}{'value'},

You should replace anything that looks like <placeholder>:

'pCh|pChanjoBuild:n' => \$parameter{'pChanjoBuild'}{'value'},  # ChanjoBuild coverage analysis
'chbdb|chanjoBuildDb:s' => \$parameter{'chanjoBuildDb'}{'value'},  # Central SQLite database path
'pCh_C|pChanjoCalculate:n' => \$parameter{'pChanjoCalculate'}{'value'}, # Chanjo coverage analysis
'chccut|chanjoCalculateCutoff:n' => \$parameter{'chanjoCalculateCutoff'}{'value'}, # Cutoff used for completeness

Again, program options begin with a leading “p” by convention. Make sure you don’t cause any naming conflicts.

Lists can also be specified with a special syntax. Basically you need to assign the option to an array instead of $scriptParameters.

'ifd|inFilesDirs:s'  => \@inFilesDirs, #Comma separated list

Later in your code when you would like to access those values you would join on ”,”.

@inFilesDirs = join(',', @inFilesDirs);

Note

MIP doesn’t use True/False flags, all options take at least one argument. For program options it’s possible to turn on (1), off (0) and run programs in dry mode (2). All program options should specify “n(umber)” as argument type.

if-block run checker in MAIN

The if-block checks whether the program is set to run but it also has a number of additional responsibilities.

Perhaps the most important is to define dependencies. This is done by placing your if-statement after the closest upsteam process to yours. ChanjoBuild, for example, needs to wait until PicardToolsMarkDuplicates has finished processing the BAM-files before running.

# Closest upsteam dependency for Chanjo
if ($scriptParameter{'pPicardToolsMarkduplicates'} > 0) {
  # Body...
}

# This is where Chanjo fits!
if ($scriptParameter{'pChanjoBuild'} > 0) {
  # Body...
}

Next (inside the if-block) it should print an announcement to two file handles:

for my $fh (STDOUT, MIPLOGG) { print $fh "\nChanjoBuild\n"; }

Lastly it should call a Custom subroutine, e.g. for each individual sample or per family, which will write a SBATCH script(s), submit them to SLURM, which executes the module.

Note

$sampleInfo is a hash table storing sample information, for example filename endings from different stages of the pipeline. It’s used to determine input filenames for your program.

Custom subroutine

First up, let’s choose a relevant (and conflict free) name for our subroutine.

sub ChanjoBuild {
  # Body...
}

If we pass ALL nessesary variables into the subroutine and assign them as scoped variables it’s easy to overview variables used inside.

my $sampleID = $_[0];
my $familyID = $_[1];
my $aligner = $_[2];
# etc ...

a) SBATCH headers

SBATCH headers are written by the ProgramPreRequisites subroutine. It takes a number of input arguments.

ProgramPreRequisites($sampleID, "ChanjoBuild", "$aligner/coverageReport", 0, *CHANJOBUI, 1, $runtimeEst);
ProgramPreRequisites - paramaters
Parameter Example Description
Directory 11-1-1A Either a sample ID (e.g. IDN) or family ID depending on where output is stored.
Program chanjo Used in SBATCH script filename.
Program directory $aligner/coverageReport Defines output directory under Directory. Path should include current aligner by convention.
Call type 0 Options: SNV, INDEL or BOTH. Can be set to: 0 ???
File handle *CHANJO The program specific file handle which will be written to when generating the SBATCH script. Always prepend: ‘*’.
Cores 1 The number of cores to allocate.
Process time 1.5 An estimate of the runtime for the particular sample in hours.

b) Figure out i/o files

It’s up to you to figure out where your program should store output files. Basically you need to ask yourself whether putting them in the family/sample foler makes the most sense.

It’s a good idea to first specify both in- and output directories.

my $baseDir = "$outDataDir/$sampleID/$aligner";
my $inDir = $baseDir;
my $outDir = "$baseDir/coverageReport";

If you depend on earlier scripts to generate infile(s) for the new program it’s up to you to figure out the closest program upstream. After that you can ask for the file ending.

my $infileEnding = $sampleInfo{ $familyID }{ $sampleID }{'pPicardToolsMarkduplicates'}{'fileEnding'};

$sampleInfo is a hash table in global scope.

MIP supports multiple infiles and therefore MIP needs to check if the file(s) have been merge or not.This is done with the CheckIfMergedFiles subroutine, which returns either a 1 (files was merged) or 0 (no merge of files)

my ($infile, $mergeSwitch) = CheckIfMergedFiles($sampleID);

Note

$infilesLaneNoEnding is a global hash table containing information about the filename-bases (compare filename-endings).

c) Build SBATCH body

This is where you fit relevant parameters into your command line tool interface. Print everything to the file handle you defined above.

print CHANJOBUI "
# ------------------------------------------------------------
#  Create a temp JSON file with exon coverage annotations
# ------------------------------------------------------------\n";
print CHANJOBUI "chanjo annotate $storePath using $bamFile";
print CHANJOBUI "--cutoff $cutoff";
print CHANJOBUI "--sample $sampleID";
print CHANJOBUI "--group $familyID";
print CHANJOBUI "--json $jsonPath";

# I'm done printing; let's drop the file handle
close(CHANJOBUI);

Note

A wait command should be added after submitting multiple processes in the same SBATCH script with the & command. This will ensure SLURM waits for all processes to finish before quitting on the job.

d) Call FIDSubmitJob

This subroutine is responsible for actually submitting the SBATCH script and handling dependencies. You should only call this if the program is supposed to run for real (not dry run).

if ( ($runMode == 1) && ($dryRunAll == 0) ) {
  # ChanjoBuild is a terminally branching job: linear dependencies/no follow up
  FIDSubmitJob($sampleID, $familyID, 2, $parameter{'pChanjoBuild'}{'chain'}, $filename, 0);
}
FIDSubmitJob - paramaters
Parameter Example Description
Sample ID 11-1-1A The sample ID/person IDN
Family ID 11 The family ID
Dependency type 2 Choose between type 0-4 (see below)
Chain key $parameter{‘pChanjo’}{‘chain’} The chain defined in DefineParameters
SBATCH filename $filename Always use this variable. It automagically points to your SBATCH script file.
Script tracker 0 Huh? Something about parallel processes...

To figure out which option (integer) to supply as the third argument to FIDSubmitJob you can take a look at this illustration.

_images/FIDsubmit.png

Note

$filename is a variable that is created in ProgramPreRequisites. It points to your freshly composed SBATCH script file and should be supplied to FIDSubmitJob by all custom subroutines.

Note

$parameter{'pChanjoBuild'}{'chain'} is just the chain that you set in DefineParameters. In this case we could’ve replaced it with “MAIN”.

Further information

For your convinience a template program module can be found in the project folder hosted on GitHub. [ADD LINK TO TEMPLATE]