Bug #2192

ctbin segfaults on some machines when running in parallel

Added by Cardenzana Josh over 6 years ago. Updated over 6 years ago.

Status:ClosedStart date:09/11/2017
Priority:NormalDue date:
Assigned To:Cardenzana Josh% Done:

100%

Category:-
Target version:1.5.0
Duration:

Description

Running ctbin on some machines causes a segfault when running in parallel. Setting the number of openmp threads to 1 seems to resolve this issue (i.e. OMP_NUM_THREADS=1 before running the analysis). It usually fails when reading in one of the data files. Specifically whether it fails when loading the data or when accessing the data after it’s been loaded, I’m not sure.


Recurrence

No recurrence.

History

#1 Updated by Cardenzana Josh over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 50

Running the code through valgrind and looking at the stack trace showed that the segmentation fault was consistently occuring in the GFits classes (GFits, GFitsHeader, GFitsHeaderCard, etc...), and particularly in the cfitsio portions of the code. After double checking several portions of ctbin and the various GFits classes I did a quick search for cfitsio openmp which immediately brought up this helpful link: https://heasarc.gsfc.nasa.gov/fitsio/c/c_user/node15.html

The key point is the statement “When used in a multi-threaded environment, the CFITSIO library must be built using the -D_REENTRANT compiler directive”. This allows reading/writing to multiple FITS files simultaneously. The ultimate solution is to configure cfitsio with the extra flag '--enable-reentrant’. Since doing that I haven’t had any issues running in parallel.

Going forward, there are two solutions that could be applied to fix this:
  1. Rewrite ctbin (and any other tools/classes that parallelize over multiple FITS files) so that they operate only on a single FITS file.
  2. Add a check in gammalib that makes sure cfitsio was compiled with the above flag when compiled with openmp support.

I would opt for option 2 as that would be the easiest to implement.

#2 Updated by Cardenzana Josh over 6 years ago

  • Status changed from Resolved to In Progress

Sorry, this is not resolved and still needs at least a fix...

#3 Updated by Cardenzana Josh over 6 years ago

  • Status changed from In Progress to Pull request

The ultimate fix for this has been implemented. The root cause of the segfault appears to be that there is a file check just before the critical region in GCTAEventList::fetch(). This check is actually not necessary since the same check should be done when loading the file (i.e. in the critical section). Changing this resolves the segmentation fault without the need to compile cfitsio in a special way.

Pull Branch:
Gammalib: Josh Cardenzana / gammalib: 2192-ctbin_segfault

#4 Updated by Knödlseder Jürgen over 6 years ago

  • Status changed from Pull request to Closed
  • Assigned To set to Cardenzana Josh
  • Target version set to 1.5.0
  • % Done changed from 50 to 100

Merged into devel

Also available in: Atom PDF