Tensorflow2.4.0, CUDA11.0, cudnn8.0.2でUnknownError: Failed to get convolution algorithm. (Windows10, Anaconda Python3.8.5)

Python

cassea
cassea

Don’t worry, you will be able to read this article to translate by using DeepL or something!!

エラーの内容

なんかめっちゃ怒られました。↓

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-11-bfc2e5485c93> in <module>
      7 generator.fit(x_train)
      8 
----> 9 history = model.fit_generator(generator.flow(x_train, t_train, batch_size=batch_size),
     10                               epochs=epochs,
     11                               validation_data=(x_test, t_test))

~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   1845                   'will be removed in a future version. '
   1846                   'Please use `Model.fit`, which supports generators.')
-> 1847     return self.fit(
   1848         generator,
   1849         steps_per_epoch=steps_per_epoch,

~\anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1098                 _r=1):
   1099               callbacks.on_train_batch_begin(step)
-> 1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
   1102                 context.async_wait()

~\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
    826     tracing_count = self.experimental_get_tracing_count()
    827     with trace.Trace(self._name) as tm:
--> 828       result = self._call(*args, **kwds)
    829       compiler = "xla" if self._experimental_compile else "nonXla"
    830       new_tracing_count = self.experimental_get_tracing_count()

~\anaconda3\lib\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
    886         # Lifting succeeded, so variables are initialized and we can run the
    887         # stateless function.
--> 888         return self._stateless_fn(*args, **kwds)
    889     else:
    890       _, _, _, filtered_flat_args = \

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
   2940       (graph_function,
   2941        filtered_flat_args) = self._maybe_define_function(args, kwargs)
-> 2942     return graph_function._call_flat(
   2943         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
   2944 

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1916         and executing_eagerly):
   1917       # No tape is watching; skip to running the function.
-> 1918       return self._build_call_outputs(self._inference_function.call(
   1919           ctx, args, cancellation_manager=cancellation_manager))
   1920     forward_backward = self._select_forward_and_backward_functions(

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    553       with _InterpolateFunctionError(self):
    554         if cancellation_manager is None:
--> 555           outputs = execute.execute(
    556               str(self.signature.name),
    557               num_outputs=self._num_outputs,

~\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     57   try:
     58     ctx.ensure_initialized()
---> 59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential/activation/Relu (defined at <ipython-input-11-bfc2e5485c93>:9) ]] [Op:__inference_train_function_966]

Function call stack:
train_function

このエラーの対処に半日溶かしました。Tensorflowに詳しかったらすぐに対処できたんだろうな。。まあPyTorch派なので、しょうがない(言い訳)

環境は以下の通りです。

  • Windows10
  • Anaconda (Python3.8) -> インストールしたら結果的にPython3.8.5だった
  • jupyter-lab
  • Tensorflow2.4.0
  • CUDA11.0
  • cudnn8.0.2

解決した方法

上のエラーで大事なのはココ

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

です。ただcuDNNのせいじゃなくて、Tensorflowがメモリを全部占有していて、

GPU
GPU

いやもうメモリの空き無いんやけど。。I don’t have available memory any more!!

となっているからです。

下のコードを最初に持ってきました。これでOK!

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Tensorflow v2 Limit GPU Memory usage #25138を参考にしました)

反省点

nvidia-smiしてみていれば気づけたはずだった。要するに、今回これが起こったのは、他の処理でGPUのメモリを占有していたから。notebook1でGPUを使う処理をして、そのnotebookを終了しないままnotebook2でもまたGPUを使う処理をしようと思ったからこうなった。PyTorchではこんな理不尽な(勝手に必要以上の)GPU占有はたしかされないはず。jupyter-lab, jupyter notebookへの理解も足りなかったというか、忘れていた。前職でこういうのあったんだよね、そういえば。。

(base) PS C:\Users\cassea> nvidia-smi
Thu Mar 25 18:42:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.72       Driver Version: 461.72       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3060   WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   61C    P2    71W / 180W |  12196MiB / 12288MiB |     37%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

まあ良い勉強になったと思って。備忘録として残しておきます。

タイトルとURLをコピーしました