BSOD a day

HI guys,

Hopefully someone here can help me sort this out.

I'm having about one or two BSOD a day.

I've upload 2 minidumps to see if some kind soul can take a look at them and give me a clue. I tried whocrashed and to figure it out on my own but I feel like a deer in the headlights right before the thump.....

BTW: I've already installed the latest drivers, done windows updated, sfc /scannow check. everything seems to be in order until I get :(

Any guidance or direction to keep on researching would be appreciated!!!

I've uploaded them to skydrive

http://sdrv.ms/14gKHs3

 

 

http://sdrv.ms/14gKJQu



* Please try a lower page number.

* Please enter only numbers.

* Please try a lower page number.

* Please enter only numbers.

Ok, so the processor is still in play.

Temps show within range but it tough to find the cause when diagnostic tools are iffy.

Is there anyway to know definitively (without changing out the processor)?

As far as doing a refresh, it would take me hours to reinstall all the desktop app. For that I might as well do a clean install as start from scratch. A weekend affair. What's worse is doing a reinstall is no guarantee that it W8 will be stable afterwards.

Sometimes it seems that Windows a Divine mystery with no clear cut answers. It's frustrating. Especially when you think everything if working fine and then BAM.

Reminds me Al Pacino in GF3.

Here's hoping that the next freeze yields some light with the Kernel dmp and driver verifier.

Thanks guys,

Tanger

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Hi,

Did you contact EVGA Support and Forums to see if there were known issues?

Did you do a Refresh or a Reset?

Did you try a fan even though the heat levels are reasonable? Is the processor fan working?

You can try removing the heat sink - replacing the compound - then reseat the compound.

Rob - SpiritX



 

Rob Brown - past Microsoft MVP - Windows Insider MVP 2016 - 2021
Microsoft MVP Windows and Devices for IT 2009 - 2020

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Hi Rob,

Yes, I checked the EVGA forums on the MB and the cards. No major problems and I am using the latest BIOS/Chipset and graphic drivers.

I have not done a refresh or reset because that would take hours reinstalling programs with no assurance that the problem would be solved. (especially if it's a hardware issue).

Processor fan is working an temps are within range for both idle and load (actually running a below max).

I have not tried removing the heat sink because not 100% sure it is a hardware issue (and the temps within range suggest that the heat sink is working fine.

I'm holding out hope against hope that driver verifier or the kernel dump will reveal the only mystery the Divine Intelligence itself can't seem to answer. How the world started, how it will end and why we are here...no problem...BSODs...sorry, no can do...:)

Just having a little fun to alleviate the frustration.

Thanks again guys,

Tanger

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Add a fan anyway as a heat related error does not always mean excessive or unexpected levels.
Some component could be too sensitive to the normal levels and fail. Sometimes adding some
extra cooling discovers that.

Well the Big Bang was heat related. :)

Rob - SpiritX
Rob Brown - past Microsoft MVP - Windows Insider MVP 2016 - 2021
Microsoft MVP Windows and Devices for IT 2009 - 2020

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.


Well the Big Bang was heat related. :)

Rob - SpiritX

And less mysterious that BSOD errorss :)

I'll take off the cover and see how that affects temps. I don't have any more space to add fans. (it's a big case and I had already added a second optional fan on top....fans are maxed out ..and yes, all are working)

I could try fans that move MORE are I suppose.

Tanger

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Hi,

Use a BIG FAN blowing into the unit as you want to dramatically change the temps or as much

as possible.

Rob - SpiritX

Rob Brown - past Microsoft MVP - Windows Insider MVP 2016 - 2021
Microsoft MVP Windows and Devices for IT 2009 - 2020

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Ok here we go, a kernel dmp of a clock watchdog bsod.

http://www.mediafire.com/download/fff98t5bqmh5yly/091213-10998-01.dmp

I'm so hoping this reveals the problem. No fun working on a computer that you don't trust.

Thanks guys!

Tanger

P.S. I had installed CPUID HW monitor and it show all the temps normal (and even a bit cool) for everything (MB, CPU, Graphic cards) at the time of the BSOD

P.S.P.S. I think, not sure that the clock watchdog BSOD started showing up after update the silicon image drivers. (I was doing research and found someone had the same issue where a driver was causing the cpu to bsod). I thought back and other that updating the graphic drivers, the only major change has been the silicon image drivers. No idea, just trying to offer as many clues and info as possible.

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Hi,

The DMP file you attached is a Minidump - Mini Kernel Dump File: Only registers and stack trace are available

Did you make the change from Small memory dump to Kernel in the System settings?

In regards to the silicon drivers, you can do a rollback if you wish for troubleshooting purposes.

Regards,

Patrick
Debugger/Reverse Engineer.

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Sorry Patrick, I uploaded the wrong file.

I did change to kernel dump but I uploaded the file from the minidmp folder instead of the correct file from the root.

Here it is (all 700 megs of it):

http://sdrv.ms/1eIFGup

If it's the silicon driver, I'm prepared to remove it and the drive it supports altogether for the sake of a stable system.

Thanks!

Tanger

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

Hi,

My apologies for the late reply, forums have been going through updates today and I had to actually update my Debugging tools because the kernel wasn't working.

Please note that this will be a fairly long post because *101 bugchecks are very complicated and I'd like to provide as much information as my knowledge permits.

Right, so as per usual, the attached DMP file is of the CLOCK_WATCHDOG_TIMEOUT (101) bugcheck.

BugCheck 101, {19, 0, fffff880017e5180, 6}

19 clock ticks in regards to the timeout.

fffff880017e5180 is the PRCB address of the hung processor, let's keep this address in mind.

Running a !prcb on processor 0:

0: kd> !prcb 0
PRCB for Processor 0 at fffff8007377a180:
Current IRQL -- 13
Threads--  Current fffff800737d4880 Next 0000000000000000 Idle fffff800737d4880
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 05b0f044
Times -- Dpc    00000742 Interrupt 0000020e
         Kernel 000e923f User      0001143e

No match for address, let's try processor 1 this time:

0: kd> !prcb 1
PRCB for Processor 1 at fffff880009bf180:
Current IRQL -- 0
Threads--  Current fffffa8013675040 Next 0000000000000000 Idle fffff880009caf40
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count -- 05a1e94a
Times -- Dpc    00000004 Interrupt 00000047
         Kernel 000f2ca2 User      000079cc

Nope, no match either. I'll spare you the space in the post and tell you that processor #6 is the one we're looking for :+)

0: kd> !prcb 6
PRCB for Processor 6 at fffff880017e5180:
Current IRQL -- 0
Threads--  Current fffff880017f0f40 Next fffffa8010eb1b00 Idle fffff880017f0f40
Processor Index 6 Number (0, 6) GroupSetMember 40
Interrupt Count -- 06019b1f
Times -- Dpc    000017ef Interrupt 000003d8
         Kernel 000ed494 User      0000cf3c

For reference, I did not do !prcb 0 through 6. That would have been very tedious. Instead, you can run the !running -it command. The "i" argument causes it to display idle procs too, and "t" displays the stack trace for the thread running on each proc.

Hint: At times, the 4th parameter of the bugcheck will show you the responsible processor. For example, in your *101 here, it was correct as the 4th parameter was 6.

As this matches the 3rd parameter of the bugcheck, processor #6 is the responsible processor. Now with the information we have here thus far, we know that processor #6 reached 19 clock ticks without responding, therefore the system 'd. Before we go further, what is a clock tick? A clock interrupt is a form of interrupt which involves counting the the cycles of the processor core, which is running a clock on the processors to keep them all in sync. A clock interrupt is handed out to all processors and then they must report in, and when one doesn't report in, you then crash.

If we look specifically at processor #6, we can see it did...well... nothing:

  6    fffff880017e5180  fffff880017f0f40 ( 0) fffffa8010eb1b00 (15) fffff880017f0f40  ................

Child-SP          RetAddr           Call Site
00000000`00000000 00000000`00000000 0x0

Now how and why did this take place? First, let's check the IRQL of each one of the processors before the system crash:

0: kd> !irql 0
Debugger saved IRQL for processor 0x0 -- 13
0: kd> !irql 1
Debugger saved IRQL for processor 0x1 -- 0 (LOW_LEVEL)
0: kd> !irql 2
Debugger saved IRQL for processor 0x2 -- 0 (LOW_LEVEL)
0: kd> !irql 3
Debugger saved IRQL for processor 0x3 -- 0 (LOW_LEVEL)
0: kd> !irql 4
Debugger saved IRQL for processor 0x4 -- 0 (LOW_LEVEL)
0: kd> !irql 5
Debugger saved IRQL for processor 0x5 -- 0 (LOW_LEVEL)
0: kd> !irql 6
Debugger saved IRQL for processor 0x6 -- 0 (LOW_LEVEL)

As you can see, the IRQL of the first processor is 13 (which is CLOCK for x64 processors) and the rest are all 0. So we can see that only Processor 0 was at CLOCK level.

Now that we have the IRQL, let's look at the call stack of the different processors for more info. Let's start with Processor 0 (warning, it's large):

fffff800`72396878 fffff800`7365beee nt!KeBugCheckEx
fffff800`72396880 fffff800`73520774 nt! ?? ::FNODOBFM::`string'+0x14543
fffff800`72396900 fffff800`73438eca nt!KeUpdateTime+0x2ec
fffff800`72396ae0 fffff800`734d573a hal!HalpTimerClockInterrupt+0x86
fffff800`72396b10 fffff800`73507fe9 nt!KiInterruptDispatchNoLockNoEtw+0x1aa
fffff800`72396ca0 fffff800`7353708c nt!KeFlushMultipleRangeTb+0x290
fffff800`72396ea0 fffff800`7360ec08 nt!MiFlushPteList+0x2c
fffff800`72396ed0 fffff800`736f47e9 nt!MmFreeSpecialPool+0x2ec
fffff800`72397010 fffff880`01a73b7b nt!ExFreePool+0x6d8
fffff800`723970f0 fffff880`01b701eb ndis!NdisFreeCloneNetBufferList+0x6b
fffff800`72397140 fffff880`01ca1ff6 NETIO!NetioDereferenceNetBufferList+0xcb
fffff800`723971d0 fffff880`01caa115 tcpip!WfpProcessInTransportStackIndication+0xabb
fffff800`723977e0 fffff880`01ca1198 tcpip!InetInspectReceiveDatagram+0x255
fffff800`72397900 fffff880`01c9fd4b tcpip!UdpBeginMessageIndication+0x78
fffff800`72397a50 fffff880`01c9f67e tcpip!UdpDeliverDatagrams+0x18b
fffff800`72397be0 fffff880`01c9c082 tcpip!UdpReceiveDatagrams+0x1a4
fffff800`72397cf0 fffff880`01c9c338 tcpip!IppDeliverListToProtocol+0xf2
fffff800`72397da0 fffff880`01ca03bb tcpip!IppProcessDeliverList+0x68
fffff800`72397e50 fffff880`01c9de11 tcpip!IppReceiveHeaderBatch+0x21b
fffff800`72397f80 fffff880`01c9f253 tcpip!IpFlcReceivePackets+0x641
fffff800`723981b0 fffff880`01caa2d9 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x2ce
fffff800`72398280 fffff800`735319a6 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0x119
fffff800`72398380 fffff800`73534405 nt!KeExpandKernelStackAndCalloutInternal+0xe6
fffff800`72398480 fffff880`01caa3ce nt!KeExpandKernelStackAndCalloutEx+0x25
fffff800`723984c0 fffff880`01a72b06 tcpip!FlReceiveNetBufferListChain+0xae
fffff800`72398540 fffff880`01a72560 ndis!ndisMIndicateNetBufferListsToOpen+0x126
fffff800`723985f0 fffff880`01a72843 ndis!ndisInvokeNextReceiveHandler+0x650
fffff800`723986c0 fffff880`050c23c9 ndis!NdisMIndicateReceiveNetBufferLists+0xd3
fffff800`72398770 fffff880`050b1a48 Rt630x64!MpHandleRecvIntPriVLanJumbo+0xb0d
fffff800`72398960 fffff880`01a732ff Rt630x64!MPHandleMessageInterrupt+0x35c
fffff800`723989d0 fffff880`01a7341c ndis!ndisMiniportDpc+0xff
fffff800`72398a60 fffff800`73504ca1 ndis!ndisInterruptDpc+0x9c
fffff800`72398af0 fffff800`735048e0 nt!KiExecuteAllDpcs+0x191
fffff800`72398c30 fffff800`735059ba nt!KiRetireDpcList+0xd0
fffff800`72398da0 00000000`00000000 nt!KiIdleLoop+0x5a

^^ I have underlined what's important here.

Processors 0, 2, 3, and 4 both started with the IdleLoop routine, which is basically the start of the System Idle Process you see in Task Manager. Essentially all of these processors were sitting & waiting to do something.

We can see in Processor 0 went from:

nt!KiIdleLoop+0x5a - Waiting to do something.

to

nt!KiRetireDpcList+0xd0 - Function that will sit in a loop dequeing DPCs from the current processor’s DPC queue and calling the callbacks. I will explain DPC's below.

hal!HalpTimerClockInterrupt+0x86 - We then eventually see that Processor 0 received an interrupt. This interrupt happened to be a clock interrupt.

nt!KeUpdateTime+0x2ec - The clock interrupt then involved updating the system time. This is something that is replicated across all processors so that all the processors update their own timers and things are kept track of. Remember, everything needs to be in sync!

nt!KeBugCheckEx - We also then finally see that Processor 0 was the processor that performed the bugcheck.

----------------------------------------------------------------------------------------------------------

What is a DPC? That is a Deferred Procedure Call, which is a Microsoft Windows operating system mechanism which allows high-priority tasks (e.g. an interrupt handler) to defer required but lower-priority tasks for later execution. This permits device drivers and other low-level event consumers to perform the high-priority part of their processing quickly, and schedule non-critical additional processing for execution at a lower priority.

DPCs are implemented by DPC objects which are created and initialized by the kernel when a device driver or some other kernel mode program issues requests for DPC. The DPC request is then added to the end of a DPC queue. Each processor has a separate DPC queue. DPCs have three priority levels: low, medium and high. By default, all DPCs are set to medium priority. When Windows drops to an IRQL of Dispatch/DPC level, it checks the DPC queue for any pending DPCs and executes them until the queue is empty or some other interrupt with a higher IRQL occurs.

For example, when the clock interrupt is generated, the clock interrupt handler generally increments the counter of the current thread to calculate the total execution time of that thread, and decrements its quantum time remaining by 1. When the counter drops to zero, the thread scheduler has to be invoked to choose the next thread to be executed on that processor and dispatcher to perform a context switch. Since the clock interrupt occurs at a much higher IRQL, it will be desirable to perform this thread dispatching which is a less critical task at a later time when the processor's IRQL drops. So the clock interrupt handler requests a DPC object and adds it to the end of the DPC queue which will process the dispatching when the processor's IRQL drops to DPC/Dispatch level.

----------------------------------------------------------------------------------------------------------

Now, we can see the specific driver that requested the DPC is Rt630x64.sys which is the
Realtek PCI/PCIe Adapters driver.

So, that definitely starts us somewhere. Now, let's go further!

If we look at the call stack from Processor 5:


Child-SP          RetAddr           Call Site
fffff880`0e021360 fffff800`7353708c nt!KeFlushMultipleRangeTb+0x2a6
fffff880`0e021560 fffff800`7350aad0 nt!MiFlushPteList+0x2c
fffff880`0e021590 fffff800`735a3cdb nt!MiFreeWsleList+0x386
fffff880`0e0217b0 fffff800`735a3b83 nt!MiEmptyWorkingSetHelper+0xe7
fffff880`0e0217e0 fffff800`7360c0ce nt!MiEmptyWorkingSet+0xcb
fffff880`0e021890 fffff800`73ac313a nt!MiTrimAllSystemPagableMemory+0x266
fffff880`0e0218e0 fffff800`73ad61db nt!MmVerifierTrimMemory+0xca
fffff880`0e021910 fffff800`73ad583a nt!ViKeRaiseIrqlSanityChecks+0xdb
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for nvlddmkm.sys -
fffff880`0e021950 fffff880`045672b4 nt!VerifierKeAcquireInStackQueuedSpinLock+0xa6
fffff880`0e021990 fffff880`045bcd59 nvlddmkm+0x852b4
fffff880`0e0219e0 fffff880`04d432da nvlddmkm+0xdad59
fffff880`0e021f90 fffff880`03ce43fa nvlddmkm!nvDumpConfig+0x2396b2
fffff880`0e022040 fffff880`03ce30aa dxgkrnl!DXGCONTEXT::Render+0x41a
fffff880`0e022930 fffff800`734db453 dxgkrnl!DxgkRender+0x26a
fffff880`0e022c40 000007f9`4214118a nt!KiSystemServiceCopyEnd+0x13
000000a6`7a21df18 00000000`00000000 0x000007f9`4214118a

We can see two DirectX Kernel routine calls and then nvlddmkm.sys calls. nvlddmkm.sys is the nVidia video driver. So, let's put this all together now:

- Realtek PCI/PCIe Adapters driver in the stack

- DirectX Kernel in the stack

- nVidia video driver in the stack

From this, we can say:

1. Possible corrupt / buggy video card drivers:

Ensure you have the latest video card drivers. If you are already on the latest video card drivers, uninstall and install a version or a few versions behind the latest to ensure it's not a latest driver only issue. If you have already experimented with the latest video card driver and many previous versions, please give the beta driver for your card a try.

-- It's also possible that another device driver is corrupting the video card drivers, etc. As you mentioned this started happening right around the time of Silicon's installation, it wouldn't hurt to uninstall that software for temporary puposes.

2. Faulty video card or if integrated video faulty motherboard.

3. Faulty RAM, often a culprit in regards to DirectX kernel and MMS crashes. Run a Memtest for NO LESS than ~8 passes (several hours):

Memtest86+:

Download Memtest86+ here:

http://www.memtest.org/

Which should I download?

You can either download the pre-compiled ISO that you would burn to a CD and then boot from the CD, or you can download the auto-installer for the USB key. What this will do is format your USB drive, make it a bootable device, and then install the necessary files. Both do the same job, it's just up to you which you choose, or which you have available (whether it's CD or USB).

How Memtest works:

Memtest86 writes a series of test patterns to most memory addresses, reads back the data written, and compares it for errors.

The default pass does 9 different tests, varying in access patterns and test data. A tenth test, bit fade, is selectable from the menu. It writes all memory with zeroes, then sleeps for 90 minutes before checking to see if bits have changed (perhaps because of refresh problems). This is repeated with all ones for a total time of 3 hours per pass.

Many chipsets can report RAM speeds and timings via SPD (Serial Presence Detect) or EPP (Enhanced Performance Profiles), and some even support changing the expected memory speed. If the expected memory speed is overclocked, Memtest86 can test that memory performance is error-free with these faster settings.

Some hardware is able to report the "PAT status" (PAT: enabled or PAT: disabled). This is a reference to Intel Performance acceleration technology; there may be BIOS settings which affect this aspect of memory timing.

This information, if available to the program, can be displayed via a menu option.

Any other questions, they can most likely be answered by reading this great guide here:

http://forum.canardpc.com/threads/28864-FAQ-please-read-before-posting

Regards,

Patrick
Debugger/Reverse Engineer.

Was this reply helpful?

Sorry this didn't help.

Great! Thanks for your feedback.

How satisfied are you with this reply?

Thanks for your feedback, it helps us improve the site.

How satisfied are you with this reply?

Thanks for your feedback.

* Please try a lower page number.

* Please enter only numbers.

* Please try a lower page number.

* Please enter only numbers.

 
 

Question Info


Last updated March 24, 2018 Views 507 Applies to: