If you have been wondering why blog post have been scarce lately, it is partly because my computer blew up again. Yes, the new one that I bought in September. If you have been following along, you might remember that last year I blew a video card in my old rig. I managed to squeeze maybe six more months of use out of that old rig by putting in a new video card, until the motherboard died in August. In September I got a brand new machine, and it started having issues on December 5.
I figured I post about the symptoms and experience here in case anyone else decided to buy an Alienware Aurora-R4 with a dual NVIDIA GeForce GTX 780 setup only to have it die few months later.
The problem started when I was playing a game (it was FarCry 4 for reference) when it completely froze up. It was a hard lock-up with the non-responsive keyboard, and speakers stuck repeating a single bleep over and over again. The video winked out few seconds later and my monitor dutifully displayed a “NO DVI SIGNAL” message, but the speakers kept on going. I ended up having to power cycle it just to get rid of the noise.
This was kinda odd, since FarCry 4 has been rather remarkably polished and bug free (as it should be since it is basically FarCry 3 with a palette swap) so such hard crash was unexpected. But the machine rebooted just fine so I thought nothing of it. Since it was already late, I thought nothing of it, logged off and went to sleep assuming this was the universe’s way of telling me to get off the computer.
Next day I was doing something in Photoshop, and the machine did this again: all of a sudden my screen went blank, and then about 30 seconds later I saw BIOS POST screen and the computer started rebooting itself. Again, I was a bit concerned but after it powered up, it was fine again, and I was unable to reproduce the crash by just toying around in Photoshop so I wrote it off as a one time glitch.
It wasn’t until I went back to FarCry 4 that I saw a persistent issue. Every time I started the game it will load up, show me main menu, let me load a saved game, display a progress bar, and then as soon as the actual game would start the screen would go blank. I would then get the “NO DVI SIGNAL” message from my monitor, followed by a reboot shortly after. This happened every single time.
As soon as I had a reproducible issue, I started digging. First place I went was the Windows EventViewer which, unsurprisingly, was full of critical Kernel-Power errors. I checked the timing, and each of them coincided with the hard crash and reboot. They all looked more or less like this:
Log Name: System Source: Microsoft-Windows-Kernel-Power Level: Critical Description: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. BugcheckCode: 278 BugcheckParameter1: 0xfffffa80140da4e0 BugcheckParameter2: 0xfffff8800fc16828 BugcheckParameter3: 0xffffffffc000009a BugcheckParameter4: 0x4 SleepInProgress: false PowerButtonTimestamp: 0
This was not very helpful, but after some research I found out that Bugcheck 278 is actually equivalent to BSOD 0x116 also known as VIDEO_TDR_ERROR. The most approachable description of this issue I found was:
This indicates that an attempt to reset the display driver and recover from a timeout failed.
In other words it was a video issue that would normally result in a blue screen of death, but since it crashed the entire video processing stack said BSOD could never actually be displayed. Possible causes of this error were as follows:
- Bad video driver acting up (not unusual from nVidia)
- Bad RAM chip causing discrepancies when syncing with VRAM
- Bad video card
So I went down this list, trying to nail down the exact issue. First, I upgraded to the latest nVidia driver. I actually don’t remember which version I had when I started the process, but I knew it was slightly behind. So I downloaded the latest and greatest, and updated it. This did not solve the problem. I decided to go the other way, and tried four previous versions of the driver, as well as two previous beta versions. None of them got rid of the crashes. It’s probably worth noting I was doing “clean” installs – meaning I would uninstall, the current driver, reboot and then install another one to avoid weird conflicts.
Next I tried doing the Dell pre-boot diagnostics. It is an on-board functionality on all Dell machines and is usually available from the selective boot menu (accessed by mashing F12 during POST). It doesn’t really do anything useful, but in case of detectable hardware failures it typically spits out an error code which can be given to Dell tech support circumventing a lot of bullshit like checking if the computer is plugged in, wiggling the wires and etc. Not only that – the Dell warranty support drones usually like to tell you to run the hour long extended test anyway and refuse to stick around on the phone as you do, necessitating a call-back.
Unfortunately, the pre-boot diagnostics module gave my computer a clean bill of health. Granted, it did not really have any extended tests it could run on the video cards – it would simply check if they were present and responding. It did however confirm that there was no issues with the memory. Just to double check that, I booted into a MemTest CD and ran it for about 12 hours (started in the evening, finished next day when I came back from work) and it did not show any errors.
The Alienware machine also came with something called Alien Autopsy which is yet another diagnostic tool. This one is a bit friendlier, since it does not require you to reboot your machine, and it also has seemingly more thorough tests for the video cards. So I decided to run that as well.
The video testing involves a thorough VRAM test and few video benchmarks during which it renders some spaceships on the screen, spins them around, and tests real time shaders, transparency, graphics pipeline and etc… As soon as I started running those, my machine started crashing and rebooting itself. It was reproducible and consistently failing about half-way through the benchmarks. I couldn’t pin down the crash to a single benchmark or test case, but I ran it about 20 times and I never managed to get through all of them without the machine shutting down on me. At this point I was fairly confident it was an issue with one of the video cards.
Armed with that evidence I phoned Dell Alienware support line and gave them all of the details outlined above. The guy on the other line listened to my spiel, looked through his notes and admitted I covered pretty much all the bases. He made me check my BIOS version to see if it needs to be updated but it turned out I had the latest and greatest one. So he agreed I need video cards replaced. I was expecting him to tell me to disable SLI and start pulling cards out to narrow down which one is the faulty one, but he just set up a dispatch to replace both of my cards.
Luckily I purchased the next business day on-site service warranty, so it only took them a week and a half to get it fixed:
— Luke Maciak (@LukeMaciak) December 9, 2014
I’m happy to report that replacing the cards completely fixed my issue. I was a little concerned this was going to turn out to be a motherboard problem – because knowing my luck it would. But I haven’t seen the dreaded Bugcheck 278 crash since the new cards were installed. I’m currently trying to finish FarCry 4 so that I can go through some of my Steam Holiday Sale backlog, and probably Dragon Age Inquisition.
I also have a few book, and comics reviews in the pipeline, and I’ve been toying around with an idea of doing a Ravenflight style series but for a SF themed setting. So I’m not dead, do not unsubscribe from the blog yet.