Velvet Star Monitor

Standout celebrity highlights with iconic style.

updates

NVRM: Xid: 79, GPU has fallen off the bus

Writer Emily Wong

I am trying to do some deep-learning on my GeForce GTX 980 Ti GPU. I have a 658W power supply, but when I start running TensorFlow, I get the following error in dmesg:

[ 158.598263] ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
[ 158.598268] ata2: irq_stat 0x00400040, connection status changed
[ 158.598271] ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
[ 158.598277] ata2: hard resetting link
[ 159.602605] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b
[ 159.602609] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 159.602613] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 159.602623] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded.
[ 164.230199] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 164.237244] ata2.00: configured for UDMA/133
[ 164.237248] ata2: EH complete

It seems like a small power surge which throws down my hard drive and graphical card. So I wonder, maybe I could ramp up my GPU slowly, so that it starts using more and more power in a slower manner so that it does not create this surge?

I use Ubuntu 16.04.1 with 4.8.0-34-generic kernel, with 375.26 nvidia kernel version.

nvidia-smi
Tue Feb 7 15:02:47 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 0000:01:00.0 Off | N/A |
| 0% 42C P0 56W / 275W | 0MiB / 6077MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

I tried connecting the GPU to its own power supply (older 750W which I cannot use directly on this mother board), but a similar thing happens:

[ 81.865432] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b
[ 81.865437] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 81.865474] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 81.865484] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded.

And the extra power supply turns off. So it seems they really do not like when GPU gets activated.

6

3 Answers

As per Xid errors list (check PDF), the error 79 (GPU has fallen off the bus) can be related to variety of things such as driver or hardware issue, system memory corruption, bus error or thermal issue (overheating).

Run NVIDIA X Server Settings app (which comes with the drivers) and check the temperature, graphic clock, performance levels and GPU utilization levels.

The following post (based on this original thread), suggests to disable PCI-E ASPM (Active State Power Management) by changing boot params to pcie_aspm=off (it forcibly disables PCIe ASPM).

Related bug report: GPU has fallen off the bus.

1

How I resolved it definitely not the best way but it got the job done and now I don't see this issue. Actually this issue can be caused by multiple reasons and NOBODY has a definite fix. I've tried lots of suggestions and nothing helped. Open NVIDIA app on Ubuntu, disable NVIDIA support and switch to Intel graphics management, it will be more power-efficient and won't cause this issue. Most of the development work developers do on Linux does not require much of GPU, so if you do, put the laptop on the charge, enable NVIDIA from the app, do your work and finally when the job gets done switch to Intel again. This is turn around I'm following at the moment

I have had this problem on Dell Precision M6700. The solution was simple, although it took me ages to come up with it: taking the battery out (and no - I was not using the battery - the laptop was connected at all times to the power supply).

The idea came up when reading that the error "Xid: 79, GPU has fallen off the bus" can happen also with relation to power supply issues. Apparently, the battery does not operate correctly, and probably this was the issue.

Later on I found out that adding pcie_aspm=off to GRUB does the trick as well (see ).

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy