GCN, nvptx: Errors during device probing are fatal

Currently, we silently disable libgomp GCN and nvptx plugins/devices in presence of certain error conditions during device probing, thus typically silently resorting to host-fallback execution. Make such errors fatal, similar as for any other device access later on, so that we early and reliably notice when things go wrong. (Keep just two cases non-fatal: (a) libgomp GCN or nvptx plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not, and (b) those are available, but the corresponding devices are not.) This resolves the issue that we've got execution test cases unexpectedly PASSing, despite: libgomp: GCN fatal error: Run-time could not be initialized Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. ..., and therefore they were not offloaded to the GCN device, but ran in host-fallback execution mode. What happend in that scenario is that in 'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just silently disabled the libgomp plugin/device. Especially "entertaining" were cases where such unintended host-fallback execution happened during effective-target checks like 'offload_device_available' (host-fallback execution there meaning: no offload device available), but actual test cases then were running with an offload device available, and therefore mis-configured. include/ * cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'. libgomp/ * plugin/plugin-gcn.c (init_hsa_context): Add and handle 'bool probe' parameter. Adjust all users; errors during device probing are fatal. * plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from 'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal.
2024-03-07 14:42:07 +01:00 · 2024-03-07 14:42:07 +01:00 · a02d7f0edc
commit a02d7f0edc
parent 477c8a82f3
3 changed files with 12 additions and 7 deletions
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@ -57,6 +57,7 @@ typedef enum {
  CUDA_ERROR_OUT_OF_MEMORY = 2,
  CUDA_ERROR_NOT_INITIALIZED = 3,
  CUDA_ERROR_DEINITIALIZED = 4,
+  CUDA_ERROR_NO_DEVICE = 100,
  CUDA_ERROR_INVALID_CONTEXT = 201,
  CUDA_ERROR_INVALID_HANDLE = 400,
  CUDA_ERROR_NOT_FOUND = 500,
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@ -1513,10 +1513,12 @@ assign_agent_ids (hsa_agent_t agent, void *data)
 }

 /* Initialize hsa_context if it has not already been done.
-   Return TRUE on success.  */
+   If !PROBE: returns TRUE on success.
+   If PROBE: returns TRUE on success or if the plugin/device shall be silently
+   ignored, and otherwise emits an error and returns FALSE.  */

 static bool
-init_hsa_context (void)
+init_hsa_context (bool probe)
 {
  hsa_status_t status;
  int agent_index = 0;
@ -1531,7 +1533,7 @@ init_hsa_context (void)
 	GOMP_PLUGIN_fatal ("%s\n", msg);
      else
 	GCN_WARNING ("%s\n", msg);
-      return false;
+      return probe ? true : false;
    }
  status = hsa_fns.hsa_init_fn ();
  if (status != HSA_STATUS_SUCCESS)
@ -3337,8 +3339,8 @@ GOMP_OFFLOAD_version (void)
 int
 GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
-  if (!init_hsa_context ())
-    return 0;
+  if (!init_hsa_context (true))
+    exit (EXIT_FAILURE);
  /* Return -1 if no omp_requires_mask cannot be fulfilled but
     devices were present.  */
  if (hsa_context.agent_count > 0
@ -3355,7 +3357,7 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 bool
 GOMP_OFFLOAD_init_device (int n)
 {
-  if (!init_hsa_context ())
+  if (!init_hsa_context (false))
    return false;
  if (n >= hsa_context.agent_count)
    {
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@ -604,12 +604,14 @@ nvptx_get_num_devices (void)
      CUresult r = CUDA_CALL_NOCHECK (cuInit, 0);
      /* This is not an error: e.g. we may have CUDA libraries installed but
         no devices available.  */
-      if (r != CUDA_SUCCESS)
+      if (r == CUDA_ERROR_NO_DEVICE)
 	{
 	  GOMP_PLUGIN_debug (0, "Disabling nvptx offloading; cuInit: %s\n",
 			     cuda_error (r));
 	  return 0;
 	}
+      else if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuInit error: %s", cuda_error (r));
    }

  CUDA_CALL_ASSERT (cuDeviceGetCount, &n);