This site has been retired. For up to date information, see handbook.gnome.org or gitlab.gnome.org.


[Home] [TitleIndex] [WordIndex

Problem Recovery and Reporting

An operating system shell should be able to inform the user about problematic or unrecoverable conditions in the system. This is different from error logging. Logging is useful to a system technician and the goals and usage scenarios for that are significantly different. They involve things like archival, remote operation, aggregation, trending, pattern matching, and report generation. Those are clearly important and useful things but the goals here are much simpler.

Primary Goals

Secondary Goals

Relevant Art

OS X

Fatal system error

http://upload.wikimedia.org/wikipedia/commons/8/8a/Panic10.6.png

Basic Mode

https://lh4.googleusercontent.com/-QgG4u6YbDvU/Tp841I70QEI/AAAAAAAAXcI/o4L5s54NrAM/s400/Screen%2520Shot%25202011-10-19%2520at%25204.29.22%2520PM.jpg https://lh3.googleusercontent.com/-VOj3HJ39Wrs/Tp841f2WdxI/AAAAAAAAXcM/UKvgqm1rfy0/s400/Screen%2520Shot%25202011-10-19%2520at%25204.34.15%2520PM.jpg https://lh4.googleusercontent.com/-NPxkuBsToUQ/Tp8418Q9rnI/AAAAAAAAXcU/A0-ANt12Jrk/s400/Screen%2520Shot%25202011-10-19%2520at%25204.37.37%2520PM.jpg https://lh3.googleusercontent.com/-Amx5_6j1q1I/Tp842EW7-CI/AAAAAAAAXc8/-pninWk7G9k/s640/Screen%2520Shot%25202011-10-19%2520at%25204.37.56%2520PM.jpg

Developer Mode

https://lh3.googleusercontent.com/-CnBQopMqIGI/Tp84y7IYGkI/AAAAAAAAXbk/8c_M2z_nyhc/s400/Screen%2520Shot%25202011-10-19%2520at%25204.40.38%2520PM.jpg https://lh6.googleusercontent.com/--XyDmTHsPJo/Tp84z36H53I/AAAAAAAAXb0/i-YQFhVcWTI/s400/Screen%2520Shot%25202011-10-19%2520at%25204.45.51%2520PM.jpg https://lh3.googleusercontent.com/-nqA3FerYjrg/Tp840BILLbI/AAAAAAAAXc0/21aIrK3pX-E/s640/Screen%2520Shot%25202011-10-19%2520at%25204.46.03%2520PM.jpg

Notes

Windows 7

http://upload.wikimedia.org/wikipedia/en/thumb/1/13/Windows_Error_Reporting_problem_details.png/640px-Windows_Error_Reporting_problem_details.png

Windows 8

Fatal system error

BSOD

Ubuntu

https://wiki.ubuntu.com/Apport?action=AttachFile&do=get&target=apport-gtk-desktopfile.png https://wiki.ubuntu.com/Apport?action=AttachFile&do=get&target=apport-gtk-report.png Details

Fedora

https://lh5.googleusercontent.com/-tYY_l8yORHc/Tpc3MC-9_jI/AAAAAAAAXGw/WJdKZZSWJVY/s640/Screenshot%2520at%25202011-10-13%252014%253A45%253A25.png https://lh5.googleusercontent.com/-QYt6xNWnDkU/Tpc3L8UqbSI/AAAAAAAAXGk/IBIyYnZ0HEg/s640/Screenshot%2520at%25202011-10-13%252014%253A45%253A44.png

Firefox

http://www.squarefree.com/blogimages/crashreportdialog.png

Chrome

Aw Snap

Twitter

http://upload.wikimedia.org/wikipedia/en/d/de/Failwhale.png

Discussion

Layers of the system

The user must not be exposed to any finer granularity detail about the composition of the system.

In general, the handling of each type of trouble should be handled by the layer above it. Application problems should be handled by the Shell. Shell problems handled by Core Services (probably GDM or plymouth). And kernel failures by the boot system. And boot system failures by the firmware.

Forms of trouble

Crash

A unhandled exception where the process exits abruptly. May be able to generate useful stack traces if debugging symbols are available.

Note: Crash could be very easy to repeat and very visible, to very difficult. E.g. crash at startup vs once in the blue moon. If very easy to repeat, it will cause agitation when the same dialog(s) pop up again and again.

Misbehavior

The process has stopped responding or done something that it wasn't supposed to. It may have been denied or permitted with a warning. This may involve selinux or similar. This may be a result of misconfiguration either by a technician or a vendor. The two cases may be able to be differentiated by whether default configuration values were used. Examples include:

Misconfiguration

The process cannot handle the configuration information that it has been provided. This may be in the form of files on disk (/etc) or in a database (dconf). This may be caused by the way the program saves settings, the user, a technician, or a vendor. When caused by non-default values often the remedy is resetting to the default value. When caused by default values this can be interpreted as a failure.

Failure

An error where process has no choice but to bail out because it cannot continue. This differs from a Crash in that often the program can provide a specific reason for why it cannot continue. Stack traces may also be available if the program does a SIGABRT. For an application this may include OS version mismatches. For an OS Shell this may include hardware capability mismatches, etc.

Tentative Design

Client Side

It seems that a solution may have a few reporting modes:

Normal Mode

The default mode where the user may be informed of trouble conditions in System or Application, prompted to reset default settings in System or Application, and asked to kindly submit trouble reports. Since sending trouble reports is a secondary goal and resetting the system to an operational state is a primary goal - the sending process must be highly efficient, simple, and clear. The user should not be required to gather details beyond that which the system can gather automatically. The user may be permitted to supply addition details about what they were doing at the time. But this should not be required since it conflicts with primary goals and is very doubtfully useful. The process must not wait for downloading additional debugging information. That conflicts with primary goals and is frankly really irritating. The screens shown to the user must not contain technical details like stack traces.

Developer Mode

Same as Normal Mode with these exceptions: the prompts may contain stack trace information, and information about background processes and services may be shown.

Managed Mode

Possibly useful for managed clients or servers. Operation is similar to Normal mode except that the user is not prompted to submit reports. The user should still be informed of trouble and may be prompted to reset defaults. Administrators may hook into the low level system to extract details automatically (via push or pull).

We may also be able to use this mode if the user has elected to automatically provide feedback during the initial system setup.

Unattended Mode

Possibly useful for kiosks or similar. Failsafe fallbacks should operate without user intervention. No crash logging or reporting should occur. Failures should not expose system details to passers by.

Fatal system errors should automatically trigger a restart (up to a certain number of retries).

Private Mode?

Perhaps there should also be a private mode for when the user is going something that shouldn't be tracked such as using a web browser in incognito/private mode. In this mode the system would notify the user of problems but make no attempt to report them.

Guidelines

attachment:ProblemReporting.pdf

Also see Design/Apps/Oops

Fatal System Error

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-normal.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-normal-notification.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-developer.png

https://github.com/gnome-design-team/gnome-mockups/raw/master/oops/fatal-system-error-unattended.png

Server Side

A crash reporting server should:

Implementation Details

Please see a a proposed architecture.

Comments

OlavVitters:

JamesCape:

See Also


2024-10-23 11:03