Dmytro Karpenko

Atlas Grid Workload on NDGF resources: analysis and modeling


Workload generator

The workload generator is the python script that can generate jobs and files according to the developed model. The generator is run from the command line:

python generator.py -n <num> [-s|-e|-d] [-f <base_filename>] [-p]

Generator takes ~20 seconds and ~380 MB of memory to generate 50,000 jobs. It is distributed under Apache 2.0 license (included in the archive).

Mandatory arguments:

-n<num>, -n <num>, --number_of_jobs <num>, --number_of_jobs=<num>

The number of jobs in the generated workload. The corresponding number of files is generated automatically for this number of jobs.

Optional arguments:

Optional arguments can be used to change the default parameters of the generator.

-s, --swf

Output the generated workload in Standard Workload format (SWF). The files are omitted since they are not supported by SWF; only jobs are generated. Workloads in SWF can be seamlessly picked up by GridSim simulator.

-e, --extendedswf

Output the generated workload in custom extended SWF, with the files included in the workload. Workloads in this extended SWF can be used by GridSim's extension DataGrid.

-d, --dax

Output the generated workload in DAX format. Job interarrival times are omitted since they are not supported by DAX. Workloads in DAX format can be seamlessly picked up by SimGrid simulator.

-f<base_filename>, -f <base_filename>, --filename <base_filename>, --filename=<base_filename>

The base filename to use for output file(s). By default the generator uses the default filenames (listed in "Output" section below), which can be replaced by using this option. The suffixes are added to the specified basename, so using switches like -s -f model1 will produce the output file named model1.swf.txt

-p, --poisson

Generate interarrival times according to Poisson process. The distribution of interarrival times follows then the standard exponential distribution instead of the model's CDF. See the details.

Output:

In the absence of optional arguments (default output format)

Produces 3 files: input_files.mdl - IDs and sizes (in MB) of input files; output_files.mdl - IDs and sizes (in MB) of output files; jobs.gwf - jobs description in GWA format. In the last file two additional fields are added to the end of each line - indices of input and output files for the job. Output files field can be absent for some jobs, since our model allows the abscence of output data.

in the presence of -s switch

Produces ndgf_atlas_model.swf.txt file with jobs described in SWF.

in the presence of -e switch

Produces ndgf_atlas_modes.swf.txt file with jobs described in SWF, but also with two additional fields - indices of the jobs' input and output files. In addition input_files.swf.txt and output_files.swf.txt files are generated, containing IDs and sizes (in MB) of input and output files respectively.

in the presence of -d switch

Produces ndgf_atlas_dax.xml file with jobs described in DAX format.

Download the generator

Generation algorithm description


This section provides an overview of the algorithm of the synthetic workload generation, to give an insight into how the workload model is used and to help understand the code.

The custom distribution functions (CDFs) of the model can be divided into 3 different groups.
  • Common -- CDFs that are used in generation of all jobs and files
    • Number of input files per job (IFN-CDF). The number of input files also defines the job category.
    • Size of output files (OFS-CDF)
    • Job walltime (JW-CDF)
    • Job interarrival time (JI-CDF)
  • Per job category -- CDFs for these parameters are computed per each job category
    • Probability to request an input file from a particular file category (IFR-CDF)
    • Number of output files per job (OFN-CDF)
  • Per input file category -- CDFs for these parameters are computed per each input file category. The input file category is determined by the possible min-max values of its popularity
    • File popularity (IFP-CDF)
    • File size (IFS-CDF)
The generation process is a loop that iterates N times, where N -- the needed number of jobs. A new job is generated during each iteration according to the following routine.
  • Determine the number of input files for a job, and, therefore, the job category according to IFN-CDF.
  • Determine the categories for the input files the job needs according to IFR-CDF.
  • Check if the input files generated at previous iterations can be reused by the job. If needed, generate proper number of files in corresponding categories. When the input file is generated, it is assigned two properties: size (according to IFS-CDF) and request limit counter, that is maximum popularity (according to IFP-CDF). Assign the input files from the needed categories to the job on FIFO basis: the first available file(s) in the category gets selected. The request limit counter of the assigned file is decreased by one, and if it becomes zero -- the file is removed from the category and won't be used any more in the assignments at next iterations.
  • Determine the number of output files for the job according to OFN-CDF.
  • Determine the size of each output file according to OFS-CDF.
  • Determine the job walltime and interarrival time according to JW-CDF and JI-CDF.

Poisson process for interarrival times


The jobs of the ATLAS-NDGF workload are submitted in a controlled fashion by a single automatic tool. That makes the distribution of the interarrival times really different from what is usually observed in large data processing systems, with many jobs submitted at random times by many users. In the latter case the interarrival times tend to follow Poisson process and have the standard exponential distribution. -p switch of the generator provides an ability for the researcher to generate interarrival times according to Poisson process, thus obtaining less specificated and more generalized synthetic workload.

The interarrival times are then drawn from the standard exponential distribution rather than from the model's CDF for the interarrival times. The Kolmogorov-Smirnov distance between the CDF and standard exponential distribution and the comparison of various distribution parameters are shown below.

Parameter Real sample Standard exponential
Kolmogorov-Smirnov distance 0.59345148987472762
Mean 9.28621562086 0.582226598538
Median 2.0 0.0
Standard deviation 79.4242493684 0.959262055369
Skewness 74.9691827561 2.24708284251
Kurtosis 14251.0163082 6.96798151636
Root Mean Square (Spread) 79.9652748907 1.12212811343

See also a comparison histogram for both distributions.

GridSim integration


We performed some work to adjust our generator to be used in conjunction with GridSim grid simulator. Producing of output in the format consumable by GridSim can be turned on by -s or -e switch of the generator.

-s switch

GridSim is compatible with Standard Workload Format (SWF), and can generate workloads by reading job description in SWF from files. -s switch makes the generator output the generated jobs in SWF, so GridSim can pick up the generated workload and start using it without much preparational work. GridSim has two special classes that can read SWF files and generate corresponding workloads: Workload class from GridSim's util package and WorkloadFileReader from GridSim's parallel/util package. Researches can choose which one suits best for their goals and simply give the name of the generated file to the instance of the corresponding class. The example of usage of Workload class is given in GridSim's distribution (see examples/WorkloadTrace directory). Note, however, that SWF is a purely computation format, no information about input or output data can be specified by its means, so this switch is suitable only if the researcher is interested in simulating realistic grid or parallel workload without data transfers in the infrastructure.

-e switch

GridSim has an extension called datagrid that provides the simulator with data grid simulation capabilities. Using classes, provided by this extension, it is possible to simulate jobs that not only use the computing power, but also perform data transfers. -e switch makes the generator output the jobs in SWF, like with -s switch present, but also generate the input and output files, like when the optional switches are absent. The IDs of input and output files are added as two additional fields to the jobs' SWF file. Using all 3 generated files, it is possible to construct files and jobs that transfer these files and unify them in a single workload. Unfortunately, GridSim's datagrid does not yes possess an equivalent of Workload class, that can read traces from the file and generate the corresponding workload, and it is up to the user to construct the workload out of the files. We created an example of how this can be done. Based on the datagrid example #3 and Workload class from GridSim's distribution, we created DataWorkload class that can read files, produced by the generator, and construct a workload out of them, and NDGFDataExample class that runs the simulation. Both classes can be downloaded here. Note, that in GridSim's datagrid it is currently impossible to simulate the upload of output files to the remote resources; it is only possible to simulate the pass of the output data back to the user. Thus, the output files are not really constructed; instead, the output size of the constructed jobs is modified to have the size corresponding to the summary size of all output files for the job. Transfers of input files, however, are fully and properly simulated.

SimGrid integration


We performed some work to adjust our generator to be used in conjunction with SimGrid grid simulator. Producing of output in the DAX format, consumable by SimGrid, can be turned on by -d switch. Since DAX format is used to describe workflows with dependencies, with jobs having parent-child relations, and our workload is a combination of separate independent jobs, the resulting output is an extreme case of the workflow graph, with all the jobs being at the same level, having the same parent (the job source they all come from) and the same child (the endpoint, where all the output is sent to), but no dependencies between each other. Naturally, DAX format, that is intended to describe complete workflows, all the elements of which are known before the execution begins, does not incorporate jobs interarrival time in its schema; in case -d switch is used interarrival times are dropped by the generator. The generated file can be picked up by SimGrid right away by means of SD_daxload function. The example of how it can be done is provided in examples/simdag/dax directory of SimGrid's distribution.