Updating Windows 10: an Idea for Diagnostics

I’ve been fighting an issue with getting the latest update the Windows 10 Insider program.

Not true. I’ve been having this issue for months, across several of the latest updates.

In fact, I went so far as to do a complete wipe and re-install of Windows 10 on my system just to try to address this, to no avail.

What happens is that the download succeeds, then Update tells me it wants to restart and update, which I let it do. It gets as far as the 32% complete point, and then the computer hangs, forcing me to do a cold reset (hold down the power button for 5 seconds to turn it off). When it boots, it “reinstalls” the previous version of Windows, and then tries it again the next day.

Currently, this is happening with build 15048, on the Fast ring of Windows Insider.

The hottest search result, Windows 10 Build 10130 Upgrade Stuck at 32% – Microsoft, suggests that I need to unplug all peripherals, run a few things to clear out the detritus of the Update system, and try again.

I have done those things, several times. I’ve run the update with every network connection disabled.

Nothin’.

I’ll be posting some info to the Microsoft support forums, but here’s what’s in my head:

Why are the logs for these things in so many places and so difficult to traverse to find the root cause of a problem?

They’re in .log files, plain text. Lots of different formats for these things.

Do you know what would be awesome? Something that I thought would, by the year 2017, already be in place?

A way to do a temporary or semi-permanent diagnostic record of a program execution that could be easily used to discover what the hell happened when a program tried to run.

Yes, I know all about ETW and the fun things tied to it, like Event Viewer. While ETW is fantastic and my first-choice for all things diagnostic, it can also be a quagmire. That’s because, beyond a certain level of structure, Microsoft and ETW don’t really set up a level of taxonomy of logging events than can help simplify the process of untangling all the series of events that have been recorded.

What is needed is something like this:

  1. A long-running process begins a log book in ETW, and, in the process, acquires a unique context identifier.
  2. As the process invokes sub-processes, a child context identifier is created based upon the parent’s context identifier.
  3. All entries in the log book contain the appropriate context identifier. Keep in mind, sometimes the context identifier will cross services and processes. For example, the setup program might require a background service to start and begin to process something. That background service might be processing other requests from other processes or services, so the activities that are specific to the setup program’s request should all contain the context identifier supplied by the setup program.
  4. Most of the log book entries will contain the standard severity level indicators, like “information”, “tracing”, “warning”, “error”, or “critical.” However, whenever a sub-process is reaching a conclusion, it gets a special entry to indicate whether the execution was
    1. successful without any issues
    2. successful with issues (warnings, compensated conditions, etc.)
    3. failed due to detectable issue (security, corrupt data, non-availability of a resource, etc.)
    4. failed due to unknown issue (timeout, hardware failure, power outage, unexpected service/process termination, etc.)

At least, that’s a start. I’m sure it could be refined.

With metadata like this added into the ETW entries, it should be relatively simple to create a kind of tree that shows all the processes and subprocesses that led to a failed process (especially the installation of an OS update), and get a much better idea how to fix it.

Author

Alan McBee

comments powered by Disqus