NVIDIA module taints Linux kernel

Problem

I installed CUDA 7.0 on Ubuntu running on an Aftershock notebook with NVIDIA graphics card. The NVIDIA graphics drivers were upgraded to version 346. To my pleasant surprise, the graphics card was now directly visible to the Linux kernel. There was no longer any need to use Bumblebee.

However, I started noticing that this Ubuntu would not always boot into Unity. On many cold starts, I saw that Ubuntu would show this error:

After displaying this it would get stuck at the Ubuntu bootup screen.

I also noticed that I could boot up if I first booted into another Ubuntu instance I had on this notebook and later restarted and booted into the current Ubuntu instance.

Solution

Update: I no longer have this problem after installing CUDA 7.5 and the NVIDIA 352 drivers that come along with it on a fresh Ubuntu 15.04 system. I still see the syslog errors, but they no longer stop Ubuntu from booting successfully and the GPU/CUDA can be used without problems. Yay! 😄

Old stuff:

To analyse this problem I cropped out the relevant portions of /var/log/syslog for the case when Ubuntu booted correctly and when it threw the above kernel panic error. These syslog entries can be seen here.

What I found was that there was some kind of a race condition at boot time. If the nvidia-drm module registered early enough with the kernel, then everything was fine. Otherwise, the kernel would complain that the NVIDIA module was tainting it and then it would throw up the above error.

The problem seems to lie in the Read-Copy-Update mechanism of the kernel. Here, some optimizations seem to have been added in recent versions to improve energy efficiency. RCU wakes up the CPUs only after a period of RCU_IDLE_GP_DELAY jiffies, as explained here. This is set to 4 by default, as seen here.

The solution going around the web for this problem is to decrease this sleep time to 1 jiffy, so that the race condition can be ameliorated. Thankfully, we do not need to edit Linux kernel code and recompile to do this! A syslog entry rcu_idle_gp_delay was added for runtime manipulation, as explained here. If we set this to 1, then the chance of this error reduces a lot.

To do this, add the following line to /etc/default/grub:

GRUB_CMDLINE_LINUX="rcutree.rcu_idle_gp_delay=1"

And run update-grub after this. Hopefully, this should fix the race condition so that every boot is successful.

Related links:

Tried with: NVIDIA GTX 765M, Linux 3.13.0-44-generic and Ubuntu 14.04