About benchmarks and accounting publishing

After the changes in the accounting system in ARC 6.4.0 there were some issues related to missing benchmark values in the accounting records. Some of them are related to a bug that unfortunately snuck into the codebase, however sites can have issues with benchmarks for other reasons too.

This page is aimed to clarify how benchmarks are recorded and propageted, in what situations problems can occur, and how to fix them.

If you see HEPSPEC 1.0 is being used in the jura.log - (some of) your job records are missing benchmark values.

Follow the questions to clarify your case.

Which version of ARC you have?

Depending on the ARC version, there are several issues related to the benchmark values processing:

  • ARC < 6.4.0:

    • completely different accounting codebase is in use, information in this document is irrelevant. General advise is definitely to update to the recent version, bugs in the old codebase will not be fixed.
  • ARC < 6.5.0:

    • bug with handling benchmark values in the publishing code
    • HTCondor backend with non-shared filesystem have missing benchmarks
    • APEL summaries query performance is degrading with increasing ammount of stored records
  • ARC < 6.8.0:

    • HTCondor backend with non-shared filesystem have missing benchmarks
    • APEL summaries query performance is degrading with increasing ammount of stored records
  • ARC >= 6.8.0:

    • all known benchmark related issues are fixed
    • if you see HEPSPEC 1.0 is being used there is some valid reason for it, including missconfiguration

Note

It is important to understand that benchmark values are part of the job accounting record. Benchmark data in the job accounting record is defined on the job start time and stored when job is just finished.

If you have jobs started before update of ARC or configuration fix (depending on your case) - you need to manually fix already stored records. There is no way changes will be applied retrospectively to already stored records automatically.

What are the reasons for missed benchmark value in the job records?

There are several valid reasons when you will see HEPSPEC 1.0 is being used message:

  1. The job was started when ARC was at version < 6.5
  2. The job was started when the [queue:name] block in arc.conf had no proper benchmark defined
  3. The permissions or other issues (including HTCondor backend bug with non-shared filesystem in ARC < 6.8.0) prevents the writing of .diag files on the worker nodes
  4. The job failed in LRMS before even execution of initial jobscript wrapper part (node failure, etc).

The last issue is simply can happen very rarely and nothing to do with it, but such jobs has zero cputime, so benchmark is really irrelevant.

Nevertheless, to eliminate the log message that annoys admins and avoid aditional type of summary records during the publishing, ARC 6.8.0 introduced benchmark option in [lrms] block that will be used as a fallback if the benchmark metric is missing in the job data.

Warning

Again! The benchmark option in [lrms] block has no influence on already stored records. It is during storing time but NOT publishing time.

How to fix missing benchmark values manually?

Already stored accounting records that has no benchmark values can be fixed by issuing an sqlite query that adds benchmark value. Following example assumes the controldir is the default /var/spool/arc/jobstatus:

[root ~]# sqlite3 /var/spool/arc/jobstatus/accounting/accounting.db "insert into JobExtraInfo
  (RecordID, InfoKey, InfoValue) select distinct RecordID, 'benchmark', 'HEPSPEC:12.1'
  from JobExtraInfo where RecordID not in
  (select RecordID from JobExtraInfo where InfoKey='benchmark');"

If you discover that some records use the default benchmark of HEPSPEC 1.0 instead of your desired benchmark value in arc.conf (e.g. you had added benchmark values after job start) you can update the values as well:

[root ~]# sqlite3 /var/spool/arc/jobstatus/accounting/accounting.db "update JobExtraInfo
  set InfoValue = 'HEPSPEC:12.1' where InfoKey = 'benchmark' and InfoValue = 'HEPSPEC:1.0';"

What should I know to avoid running into bechmark issues?

To understand how the HEPSPEC 1.0 is being used occurs in the jura.log there are 3 points to understand:

  1. JURA is only the publisher and it sends the data about the jobs stored in the local ARC accounting database. NO values from arc.conf (apart from where to publish records) are used during publishing.
  2. Info about benchmarks is part of the job accounting data stored in the ARC local accounting database when the job is in the finishing state. Moreover, the static data, including the benchmark defined in arc.conf are defined during the jobscript generation (job start time). Any update to arc.conf AFTER the job start HAS NO EFFECT on already stored records.
  3. In case of publishing to APEL, the default method to use is APEL summaries. This means that jura will send (update) the total counters about last 2 month of data that aggregated per VO, DN, Endpoint (include queue) and Benchmark! CONSEQUENTLY if any single job within 2 month timeframe is missing the benchmark data - this warning about using HEPSPEC 1.0 will be there!

Warning

For ARC < 6.8.0 the APEL summary query includes grouping by benchmark which was out of scope of the initial ARC accounting database design. The extra tables join is harmful to performance on heavy loaded sites! The recommended mitigation to save ARC CE CPU cycles is to go back to individual usage records publishing with apel_messages = urs option.

In the ARC 6.8.0 the APEL summary querying were improved and performance hit is not that valuable. You can use summaries on the heavy loaded sites as well.

It is also important to understand the chain of benchmark propagation for the issues troubleshooting:

  1. The value of benchmark defined in the [queue:name] block block in arc.conf is written to the .diag file as it is on the frontend (controldir).
  2. The .diag file from the control directory is copied next to the job’s session directory and either shared to worker node (shared sessiondir case) or moved by LRMS. See more details about shared vs non-shared sessiondir in the Job scratch area document.
  3. During job execution jobscript writed data to .diag on the worker node. This includes benchmark that can be redefined in runtime (e.g. by RunTime Environments in ARC6)
  4. After job completion the .diag from worker node is moved to the frontend’s session directory if sessiondir is not shared.
  5. On the frontend .diag from session directory merged with .diag in the control directory and more information from the LRMS accounting is added to it.
  6. A-REX parse the .diag in the control directory and store data to the database. From ARC 6.8.0 at this stage the default fallback benchmark is added to the data from arc.conf if missing in the .diag.

So, should I do something if I see “HEPSPEC 1.0 is being used” message?

If this is a rare single job that just failed in LRMS before writing the accounting data - nothing to worry about.

But if it annoys you, you can fix even single job data manually as describer above. Or starting from ARC 6.8.0 you can define the fallback benchmark to aviod it completely.

To identify how many jobs are missing benchmark data in the database, run the following query:

[root ~]# sqlite3 /var/spool/arc/jobstatus/accounting/accounting.db "select JobID from AAR
  where RecordID not in ( select RecordID from JobExtraInfo where InfoKey='benchmark');"

This returns list of the job IDs with missing benchmark data. Than you can use:

[root ~]# arcctl accounting job info <JobID>

to find what are those jobs.

If there are many, than something definitely goes wrong and you should:

  1. Check if you are facing the knows issues if you are running ARC < 6.8.0. ARC update + manual records fix will solve your problem in this case.
  2. Check the arc.conf syntax in respect to benchmark. It should be defined in the [queue:name] block and use either HEPSPEC or si2k. Manual records fix for already stored records is needed anyway.
  3. Check the .diag file contains information, use arcctl to check stored data, check A-REX logs for any hints.
  4. Open a bugzilla ticket if nothing helps.