The workflow table splitting module¶
Introduction¶
This module is used when you want to split a file into multiple parts, normally to enable analysis to proceed in parallel. The most common example of this is to split the list of templates output by a template bank generation code to enable a set of matched-filter jobs to analyse that bank in parallel. If you want to do something similar this module is the place to do it.
The return of the table splitting module is a pycbc FileList of the split files generated by this module.
Usage¶
Using this module requires a number of things
A configuration file (or files) containing the information needed to tell this module how to generate (or gather) the template banks (described below).
An initialized instance of the pycbc Workflow class, containing the ConfigParser.
A FileList of the files that are to be split.
This module is then called according to
- pycbc.workflow.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]
This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.
- Parameters
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns
split_table_outs – The list of split up files as output from this job.
- Return type
Configuration file setup¶
Here we describe the options given in the configuration file used in the workflow that will be needed in this section
[workflow-splittable] section¶
The configuration file must have a [workflow-splittable] section, which is used to tell the workflow how to construct the split output files. The first option to choose and provide is
splittable-method = VALUE
The choices here and their description are as described below
IN_WORKFLOW - The file splitting jobs will be added as jobs in the workflow and will be generated after submission of the workflow.
NOOP - Do nothing and return the input file list. It is better not to call the module at all if you do not want to split files, but this can be useful if you want to use an existing script and do not need the splittable functionality.
When using IN_WORKFLOW the following additional option is needed:
splittable-num-banks = VALUE - Specifies how many parts to split each input file into.
[executables]¶
In this section, if not using NOOP, you need to supply the executable that will be used to generate the time slide files. This is done in the [executables] section by adding something like:
splittable = /path/to/pycbc_splitbank
The option, in this case ‘splittable’, will be used to specify the constant command line options that are sent to all pycbc_splitbank jobs. These will need to be put in a section called [splittable] and the options themselves are discussed below. The tag ‘splittable’ cannot be changed currently.
FIXME: Tag support is not yet present in splittable, the following is currently untrue, but should be fixed. As with other modules tagged subsections [splittable-TAG] [workflow-splittable-TAG] sub-sections are supported, if this module needs to be run in different configurations
Supported splittable executables and instructions for using them¶
The following splittable executables are currently supported:
pycbc_splitbank
lalapps_splitbank - NOTE: The output of this code can be unpredicatable, or broken. We strongly recommend using pycbc_splitbank. For this reason we do not give any further details about running this code.
Adding a new executable is not too hard, please ask a developer for some pointers on how to do this if you want to add a new code.
pycbc_splitbank¶
pycbc_splitbank is a pycbc python code that can be used from splitting any table in an input xml file. Normally this splits the sngl_inspiral table that holds the template bank. The help message for pycbc_splitbank is as follows
$ pycbc_splitbank --help
usage: pycbc_splitbank [-h] [--version]
(--templates-per-bank SAMPLES | -n N | -O [OUTPUT_FILENAME [OUTPUT_FILENAME ...]])
[-o OUTPUT_PREFIX] [-V] -t INPUT_FILE
[--sort-frequency-cutoff SORT_FREQUENCY_CUTOFF]
[--sort-mchirp] [--random-sort]
[--random-seed RANDOM_SEED]
Splits a table in an xml file into multiple pieces.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--templates-per-bank SAMPLES
number of templates in the output banks
-n N, --number-of-banks N
Split template bank into N files
-O [OUTPUT_FILENAME [OUTPUT_FILENAME ...]], --output-filenames [OUTPUT_FILENAME [OUTPUT_FILENAME ...]]
Directly specify the names of the output files. The
number of files specified here will dictate how to
split the bank. It will be split equally between all
specified files.
-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Prefix to add to the template bank name (name becomes
output#.xml[.gz])
-V, --verbose Print extra debugging information
-t INPUT_FILE, --bank-file INPUT_FILE
Template bank to split
--sort-frequency-cutoff SORT_FREQUENCY_CUTOFF
Frequency cutoff to use for sorting the sub banks
--sort-mchirp Sort templates by chirp mass before splitting
--random-sort Sort templates randomly before splitting
--random-seed RANDOM_SEED
Random seed to use when sorting randomly
An example of a pycbc_splitbank call is given below
/home/spxiwh/lscsoft_git/executables_master/bin/pycbc_splitbank --random-sort --bank-file /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK-961585551-2048.xml.gz --output-filenames /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK0-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK1-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK2-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK3-961585551-2048.xml.gz /home/spxiwh/lscsoft_git/src/pycbc/examples/ahope/weekly_ahope/961585543-961671944/datafind/H1-TMPLTBANK_SPLITTABLE_BANK4-961585551-2048.xml.gz
The following options are added by the workflow module and must not be provided in the configuration file
–bank-file
–output-filenames
pycbc.workflow.splittable
Module¶
This is complete documentation of this module’s code
This module is responsible for setting up the splitting output files stage of workflows. For details about this module and its capabilities see here: https://ldas-jobs.ligo.caltech.edu/~cbc/docs/pycbc/NOTYETCREATED.html
- pycbc.workflow.splittable.select_splitfilejob_instance(curr_exe)[source]
This function returns an instance of the class that is appropriate for splitting an output file up within workflow (for e.g. splitbank).
- Parameters
curr_exe (string) – The name of the Executable that is being used.
curr_section (string) – The name of the section storing options for this executble
- Returns
exe class – The class that holds the utility functions appropriate for the given Executable. This class must contain * exe_class.create_job() and the job returned by this must contain * job.create_node()
- Return type
sub-class of pycbc.workflow.core.Executable
- pycbc.workflow.splittable.setup_splittable_dax_generated(workflow, input_tables, out_dir, tags)[source]
Function for setting up the splitting jobs as part of the workflow.
- Parameters
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns
split_table_outs – The list of split up files as output from this job.
- Return type
- pycbc.workflow.splittable.setup_splittable_workflow(workflow, input_tables, out_dir=None, tags=None)[source]
This function aims to be the gateway for code that is responsible for taking some input file containing some table, and splitting into multiple files containing different parts of that table. For now the only supported operation is using lalapps_splitbank to split a template bank xml file into multiple template bank xml files.
- Parameters
workflow (pycbc.workflow.core.Workflow) – The Workflow instance that the jobs will be added to.
input_tables (pycbc.workflow.core.FileList) – The input files to be split up.
out_dir (path) – The directory in which output will be written.
- Returns
split_table_outs – The list of split up files as output from this job.
- Return type