WebReplay

This page describes the design of a project for deterministic web replay in Firefox that started in September 2015.

Patches are at https://bitbucket.org/bhackett1024/mozilla-web-replay/

Implementation work is tracked in bug 1207696.

The goal of this project is to record information about the execution of a content process, such that the same execution can be replayed later, preserving all observable behaviors, and be rewound and debugged using the browser's JS debugger (and potentially other devtools in the future). Observable behaviors are graphics updates, DOM tree structure, and the execution of and values used by all JS code. There are some restrictions on which executions can be recorded.

Architecture

There are several main components to the project:

The record/replay infrastructure records enough information during recording so that the replayed process can run and produce the same observable behaviors.
IPC integration allows a replaying process to communicate with the chrome process using IPDL and shared memory.
The rewind infrastructure allows a replaying process to restore a previous state, while still maintaining communication with the chrome process.
Debugger integration allows the JS debugger to read the information it needs from a replaying process and control the process's execution (resume/rewind). The debugger is not allowed to change the replaying process's observable state.

Record/replay infrastructure

Broadly, reliable record/replay is achieved by controlling for non-determinism in the browser. This non-determinism originates from two sources: intra-thread and inter-thread. Intra-thread non-deterministic behaviors are non-deterministic even in the absence of actions by other threads, and inter-thread non-deterministic behaviors are those affected by interleaving execution with other threads, and which always behave the same given the same interleaving.

Intra-thread non-determinism is recorded and later replayed for each thread. These behaviors mainly originate from system calls (i/o and such).

Inter-thread non-determinism is handled by first assuming the program is data race free: shared memory accesses which would otherwise race are either protected by locks or use APIs that perform atomic operations. If we record and later replay the order in which threads acquire locks (and, by extension, release locks and use condvars) then accesses on lock-protected memory will occur in the same order. Atomic variables can be handled by treating reads and writes as if they were wrapped by a lock acquire/release during recording.

It is not enough, however, to just record all these non-deterministic sources and reproduce those behaviors exactly in the replaying process. The IPC and debugger integration components are active while replaying, but not while recording. Both of these involve inter-thread communication and calls to non-deterministic APIs, and the resulting non-determinism must be allowed within the replaying process.

Allowed Non-determinism

There is some slop in this design: The replaying process must be non-deterministic enough that the IPC and debugger components work, but not so non-deterministic that observable behaviors are affected. We resolve this slop by minimizing the tolerated non-determinism: The replaying process must be just non-deterministic enough that the IPC and debugger components work. In practice, non-deterministic replay is allowed in the following areas:

Heap allocations can be performed by the IPC and debugger components and so can differ between recording and replay.
JS compilations and some other internal state are affected by the debugger's presence and which hooks/breakpoints are active, and so can differ between recording and replay.
The debugger can allocate GC things, and allocation of other GC things can differ in the debugger's presence. For example, script compilation involves GC thing allocation, and observing changes in an object will change its shape.
The IPC component is generally able to function independently from the rest of the replaying process, except for the subsystem which manages shared memory (maybe mac only). The behavior of this subsystem differs between recording and replay, due to an extra process being communicated with.

Relaxing non-determinism here has a number of ripple effects in other areas. Mainly, pointer values may differ between recording and replay, and the points where GCs occur, and the set of objects collected, may differ. This non-determinism is prevented from spreading too far with the following techniques:

Different pointer values can affect the internal layout of hash tables. To prevent this from having an effect on iteration order (and execution behavior) in the table, the main table classes (for now PLHashTable and PLDHashTable) are instrumented so that they always iterate over elements in the order they were added when recording or replaying.
Differing GC behavior can cause object finalizers to run at different times. To prevent this from having an effect on execution behavior, GC finalizers which can affect the recording are instrumented so that the finalizer action is performed at the same time in the replay as it was during the recording. Even if the associated JS object has not been destroyed yet during the replay, it will never be used again because at this point in the recording it has been finalized.
Similarly, GC behavior can cause values read from weak pointers to differ between recording and replay. If this difference can affect the recording, the weak pointer must be instrumented so that during replay it holds onto its target for the same duration it was held while recording.

Recording

The chrome process directly spawns a recording content process. A recording content process differs from a normal content process in the following ways:

Calls to certain functions are intercepted by hooking them (rewriting the machine code at their entry points to call a different function with the same signature), and certain mach messages are intercepted using manual instrumentation. When a call or message is intercepted, the wrapped call/message is first performed normally and then it and its outputs are recorded in a data stream for the thread performing the call. During execution of the wrapped call/message, recording of any inner calls is suppressed, so that they are performed normally without affecting the recording. The functions which are intercepted are generally at the system call and the system library level, i.e. any function which is not compiled as part of Gecko is fair game. There are exceptions for threading library calls, and for library calls that function deterministically without causing replay problems (thus far functions have been hooked on an as-needed basis, though that might not be the best long term strategy).
Acquisition order of locks is recorded in a data stream for each lock. Some locks which are associated with non-deterministic components are not recorded, such as the JS GC and helper threads locks.
Accesses on atomic variables/fields are recorded in a global data stream, as if they were all protected by a global lock. Some atomics are related to non-deterministic components and are excluded from this stream. This is specified by using mozilla::Atomic<AtomicRecording::PreserveNothing> or by calling PR_ATOMIC_INCREMENT_NO_RECORD and so forth.
Some behaviors are changed to simplify recording, in ways that should not affect observable behaviors. For example, incremental GCs (a non-deterministic component) work by posting events to the main thread (a deterministic component), so for now incremental GCs are disabled. Many of these behavioral changes should eventually be removable.
Graphics are rendered using a DrawTargetRecording, with drawing commands forwarded to a normal native draw target.
Some additional instrumentation is performed. This includes the techniques described above under 'Allowed Non-determinism', some changes to allow replay IPC to work more easily, and odds and ends like recording values read from shared memory.

To make it easier to ensure that the non-deterministic components do not have an effect on recorded behavior, certain code regions can be marked as inactive --- while executing them no thread, lock, or atomic events should be recorded. This is done in code associated with the non-deterministic components, such as during GC or Ion compilation, to help track down behaviors that should go unrecorded.

Replaying

The chrome process spawns a middleman content process, which in turn spawns a replaying content process. For more on the middleman process, see IPC integration below. A replaying content process differs from a normal content process in the following ways:

The calls and mach messages which were intercepted during recording are also intercepted here. When a call/message is intercepted, the wrapped call/message is not performed, but rather the results of the call/message are read from the data stream and returned to the caller, emulating the behavior of the call/message. This requires the process to be deterministic enough that events play out in the same order on each thread between recording and replay. The data stream should have enough error checking in place that we can immediately detect if the replay has gone out of sync with the recording.
The recorded data streams which specify the acquisition order for each lock are read from and used so that locks are acquired in the same order. When a thread tries to acquire a lock, it first waits until it is next in line to do so.
Similarly, atomic accesses which were included in the recording will occur in the same order during replay, as if they were all protected by a single global lock. Note that while this could potentially be a big drag on performance during both replay and recording, many of the hottest atomics (refcounts, GC counters, and so forth) are associated with non-deterministic components and are not recorded.
The same changes to behavior in a recording process are also present in a replaying process.
As during recording, graphics are rendered using a DrawTargetRecording, except that drawing commands are forwarded to the middleman process using the IPC component.
The same additional instrumentation performed during recording is also performed while replaying.
Threads use file descriptors to wait on locks and notify each other, instead of using the native implementation for locking and condition variables. At least on Mac, pthreads locks/cvars seem to use a mix of process-local and kernel state, and sometimes don't work correctly after rewinding the process.
The IPC, rewind, and debugger components are all active while replaying. See the sections below for details on how these affect the process' behavior.

IPC integration

Communication between the chrome process and the replaying process is managed via a separate middleman content process. The replaying process is extended so that it can communicate with the middleman, and does so using a specialized PReplay protocol. Note that while the middleman is for now a separate content process, it could in principle be incorporated into the chrome process (using inter-thread IPDL message transport), which might be worth doing.

Middleman process

The middleman is the same as a normal content process, except that it spawns and communicates with the replaying process, and avoids creating any documents itself. Using the middleman provides the following advantages:

IPDL communication is greatly simplified. The chrome and middleman processes communicate using the standard browser protocols (PContent, PBrowser, etc.) and implementations for their actors, while the middleman and replaying processes communicate with the PReplay protocol, which is tuned to the demands of the replaying process.
The middleman can perform actions that would be extremely difficult to manage in the replaying process without going out of sync with the recording. This currently includes running the devtools content-side scripts, and rendering the drawing commands sent by the replaying process.

Replaying process extensions

During initialization the replaying process spawns the following new threads, which did not exist while recording:

The replay I/O thread communicates with the middleman process.
The replay message loop thread sends and receives messages to the middleman via the replay I/O thread.
The shared memory mechanism spawns a thread to manage memory shared with the middleman (maybe mac only).

Each of these threads has an analogue in the original recording --- the original main thread handled the message loop for the original I/O thread, and an original shared memory thread managed communication with the original chrome process. During the replay each of these original threads still exists, but they do not perform any actual IPC --- they simply perform emulated calls which read data from the recording. The code running the original and the replay specific threads is the same, so to allow one thread to perform actual IPC while having the other thread read from the recording a new mechanism is used: threads can suppress thread events in specific regions during replay, so that calls to intercepted functions are passed through directly to the wrapped function.

PReplay protocol

All communication with the replaying process is managed via the PReplay protocol. Only a single actor and a single block of shared memory are used, which are created during initialization; this restriction is very helpful for managing state during rewinding (see section below). The protocol has specialized messages for the things which the replaying process is able to do independently from the recording; currently this includes sending graphics updates, taking and restoring process snapshots, and responding to debugger requests.

Rewinding infrastructure

The replaying process can be rewound to an earlier state in response to a message from the middleman. This is done by periodically taking memory snapshots during execution, and then later restoring them. Care must be taken for efficiency when taking/restoring snapshots and for managing system resources when restoring.

Snapshots

Late in process initialization the first snapshot is taken, which is simply a copy of the stacks/registers for each thread. Each subsequent snapshot includes copies of thread stacks/registers as well as a diff of heap and static memory vs. the previous snapshot. All snapshot data is stored on disk. Diffs are computed by first setting up an exception handler thread (mac only) very similar to the one used by asm.js. When taking the first snapshot all addressable memory in the process is enumerated and write-protected, and as faults occur the exception handler thread unprotects the memory, copies its contents and marks it as dirty. When the next snapshot is taken, only the dirty memory is examined for any changes vs. the copy made.

This mechanism requires intercepting mmap (or similar low level allocation functions) so that any newly addressable memory is known --- anonymous mappings would not otherwise be intercepted or included in the recording, as heap allocation is non-deterministic while replaying. mprotect is intercepted and nop'ed to avoid interference with the dirty memory mechanism, and munmap is intercepted with no actual unmapping performed, so that memory does not need to be remapped when restoring a snapshot.

All snapshots are taken from the main thread. Before taking the snapshot, all other threads must enter an idle state, waiting on a thread-specific condition variable. Threads which were part of the recording enter this state whenever they are waiting to acquire a lock or perform an atomic action, while each replay specific thread (other than the dirty memory exception handler thread, which is not affected by taking or restoring snapshots) has a mechanism that the main thread can call to force it into the needed idle state. The main thread then computes the memory diff, reads the stacks from each thread and their register state (which each thread recorded by calling setjmp before idling).

Restoring snapshots is also done from the main thread. As for taking a snapshot, all other threads enter an idle state. The dirty memory information computed since the last snapshot was taken is used to restore the heap to the state at that last snapshot, and then the memory diffs for each snapshot can be used to restore an earlier (or later) snapshot. Threads are individually responsible for restoring their stacks; when they wake up from the idle state they see the main thread has prepared a new stack to restore to, so they longjmp to the new register state and copy in the new stack's contents.

Managing system resources

When the replaying process restores a snapshot, the state of any system resources it has open is unchanged. Care must be taken to make sure these resources are coherent to the process after the restore completes. This is done in the following ways:

Instead of creating or destroying threads on demand, while replaying all threads which will be needed are created during process initialization (we know how many will be created using the recording). These threads idle until the replay tries to 'create' them, then they run their main function, and after completing it will idle indefinitely. This ensures that no matter when we create or restore a snapshot, the same set of threads will exist and will have consistent stacks.
Locks and condition variables are to some extent system resources, and to avoid problems we make sure each thread is waiting on a consistent variable when saving or restoring a snapshot (see section above).
The record/replay mechanism has open file descriptors for the recorded data streams it is reading from; each of these needs to seek to the correct point after restoring.
IPC integration requires various system resources, mainly open file descriptors and shared memory. These are left alone when restoring a snapshot, so whenever saving or restoring a snapshot they should be in a consistent state for the replaying process. We accomplish this by setting up all needed resources early on, before taking the first snapshot, and only taking or restoring a snapshot when the IPC threads are in a specific idle state. This is the reason why the PReplay protocol uses a single actor and a single block of shared memory.

Debugger integration

When debugging a normal content process, the devtools JS debugger runs quite a bit of JS code in the content process, communicating with the chrome process primarily through streams of JSON data. When debugging a replaying content process, this JS code runs in the middleman process. When the code creates a Debugger object, that Debugger provides information about the replaying process rather than the current (middleman) process. While the Debugger can indicate it is for a replaying process, the interface should be as transparent as possible to the devtools JS code; the Debugger can still create script/object/etc. debug objects, which refer to specific things in the replaying process.

As with the devtools JS code, this Debugger lives in the middleman process, and instead of wrapping things from another compartment the debug objects hold heap structures with information about some thing in the replaying process. The Debugger can explore the heap by issuing IPDL queries to the replaying process to fill in the contents of the debug objects it creates. Whenever the Debugger is interacting with the replaying process the replaying process is paused at some point in execution, and the contents of the debug objects are only valid until the middleman notifies the replaying process that it can resume forward execution or must rewind to an earlier snapshot. When the replaying process pauses again (at a breakpoint, say) the debug objects must be reconstructed.

There is an exception to this, for scripts and script source objects; debug objects for these will continue to hold the same referent after resuming or rewinding the replaying process. This is necessary for script breakpoints to work, and is implemented by ensuring that the ordering of creation of scripts and script sources is deterministic (mainly by disabling off thread parsing, which is one of the behavior changes during recording/replay).

The user's interface to the devtools for a replaying process is the same as for a normal content process, except that new UI buttons are added for rewinding (find the last time a breakpoint was hit), and for reverse step/step-in/step-out. For now only JS state can be inspected by the debugger, though extending this to cover DOM inspection and other devtools features should not be too hard.

Unrecordable executions

There are restrictions on the executions that can be recorded. These should all be detectable during recording, so that we don't attempt to replay an execution we know will not match up with the recording. The following executions run into fundamental limits of the approach and cannot be replayed:

Executions which throw overrecursion JS exceptions can't be reliably replayed; overrecursion happens at different times depending on how scripts are compiled, which can vary between recording and replaying.
Similarly, executions which run out of memory at some point can't be reliably replayed.
Executions which are stopped at some point by the slow script dialog can't be reliably replayed. Keeping track of the exact point where an interrupt occurred would require quite a bit of recording overhead, and it doesn't seem worth it to try to do this.

The following executions are unlikely to be supported by the initial release, but should be able to be handled at some point in the future:

On x64, asm.js code relies on mprotect to handle out of bounds heap accesses; mprotect works differently while replaying, so some cooperation will be needed between the asm.js exception handler and the dirty memory exception handler.
Shared array buffers can be used by web content to introduce data races to the browser on the contents of those buffers, going against a fundamental assumption of the record/replay infrastructure. Recording and replaying executions using these buffers will require new techniques like treating all accesses on the buffers as atomic (probably unacceptable overhead) or performing all accesses on the buffer on a single core and keeping track of context switches.
DOM workers are not supported yet. For simplicity, debugger integration is currently only able to handle JS code that runs on the main thread.
WebGL is not supported yet, as it uses a pretty different rendering path from normal web content.
Media elements are not supported yet, as many of these run third party multithreaded code which hasn't been tested with the infrastructure.

TODOs

The current implementation is a prototype and needs more work before becoming useful and reliable. This section has various things that either should be done or would be nice to do, in no particular order.

The implementation works on some substantial web pages, like GMail, but will need more work to make sure it robustly functions on a wide range of web pages.
It would be nice if certain events could be posted to the main thread which are out of band with the recording, and can occur at different times during recording and replay. This would allow incremental GC and CC to occur normally in the recording and replaying processes.
Spawning all threads up front while replaying could be a problem if the process spawns many short lived threads. This hasn't been a problem so far, but if it becomes one it could be handled better by using a thread pool during both recording and replay whose size is the maximum number of live threads.
Currently the slow script dialog is just disabled during recording. This should change, with the recording invalidated if the user uses the dialog to stop a script on the page.

Porting

Almost all implementation work so far has been on OS X 10.9. Windows port work is underway, but is not yet working. The difficulties are in figuring out the set of system library APIs to intercept, in getting the memory management and dirty memory parts of the rewind infrastructure to work, and in handling the different graphics and IPC pathways on different platforms.

Comparison with other projects

There is lots of existing work in this area. The closest projects are rr, WebKit's replay project, and Microsoft's Time-Travel Debugger. Compared to rr:

This should work on all platforms and architectures supported by Gecko, though with substantial port work required.
This will be part of Gecko itself, rather than a separate tool, which means both that developers won't need additional software to use it and that this can't be used to debug other software.
This can use multiple cores during recording and replay.
This does not preserve exact behavior. Context switches can occur at different times and data races can lead to different behavior between recording and replay. Data races are bugs in and of themselves, however, so this sort of non-determinism should be fixed regardless.
This design allows the replaying process to behave differently from the recording process, which allows for a fairly straightforward implementation of IPC and Debugger integration.

Microsoft's and WebKit's replay projects operate at a higher level than rr. Inputs to the browser and non-deterministic behaviors are recorded so that they can be replayed later. In Microsoft's project the abstraction layer appears to be the boundary between the JS engine and the rest of the browser, while in WebKit's project the layer appears to be at internal WebKit APIs that can cause JS to run or the behavior of JS code to vary.

Broadly, all of these projects sit on a spectrum: at what level is the boundary between components whose behavior is recorded and the rest of the system? rr records all behavior in the user space of a process; the boundary is the system calls which are made into the kernel. This project records all behavior outside of system library calls which the process makes, with exceptions carved out for the allowed non-determinism and for draw targets. Microsoft's and WebKit's projects record a smaller subset of the browser's behavior.

This project is at a good point on this spectrum. Compared to a higher level project, this is able to operate on stable, thoroughly documented library APIs. By focusing on intercepting these APIs, browser instrumentation, recording overhead, and the maintenance burden going forward are all minimized. Compared to a lower level project, this is able to tolerate more non-determinism. All code whose behavior is recorded is compiled into Gecko (rather than being part of immutable, usually closed-source libraries) and can be lightly modified to deal with behaviors that function intercepting cannot handle, such as varying hash table layouts, the ordering of atomic accesses, and reads from shared memory.

Appendix: Debugger Details

Here is some more detailed information about how a recording/replaying process affects the debugger, and options for future improvements.

Starting record/replay

Recorded/replayed executions are different from live executions. Whether a process is recording or replaying (or neither, aka live) is specified when the process is created, and does not change later. This is (currently) done by adding a 'recordExecution' or 'replayExecution' property to the options object passed to tabbrowser.addTab(). These properties specify a directory where the recording is stored. When recording finishes, the directory should be able to be copied to another computer with the same operating system and replayed there (this functionality is thus far untested).

While the current design is nice in some respects (e.g. users can record a bug, then a developer can replay it on their computer later), it's pretty awkward for what I think will be a common use case: the developer finds a bug, then wants to rewind immediately to investigate what caused it. Currently, the workflow for this would require the developer to start a special recording tab, close the recording tab after the bug is encountered, reload the recording as a replaying tab, fast forward to the point where the bug occurs, and then finally rewind from there for investigation.

If we could transition a tab from recording to replaying then this workflow would be a lot nicer. Then a developer could start a recording tab, encounter the bug, hit the rewind button (cause the tab to switch to a replaying tab) and investigate what caused the bug.

Doing this transition is not currently possible, but it is functionality I would like to add and should be able to be implemented without a huge amount of work. If it were implemented, there would be a new consideration for the UI: when a user starts a recording tab, should they always be able to rewind later, or should they be able to specify that they are only recording and do not need to rewind. The former is a simpler interface, but being able to rewind adds overhead to compute memory snapshots. This overhead isn't bad on a powerful computer, but will still be far more than the overhead of pure recording (i.e. gigabytes of memory snapshots on disk vs. a few megabytes of recording data).

Other transitions are possible, like replaying -> recording (i.e. you record, rewind some, then fast forward to the spot where you started rewinding and resume recording) or live -> recording (i.e. you start recording on a tab that is already open). These are more speculative than recording -> replaying, though, and would be less useful for users I think.

Debugger changes

From the perspective of a devtools server, debugging a replaying process is very similar to debugging a live process. When execution is paused, the Debugger JS object and its various child objects can be used to inspect the execution state in the same way for a either kind of process. Here are the main differences:

Explicit commands must be sent to the debugger to control execution. The replayResumeBackward() and replayResumeForward() members may be called to resume execution, and the replayPause() member may be called to pause execution at the next opportunity. A replaying process can only pause at breakpoints and at snapshot points (currently these only happen when graphics updates are performed, at which point there is no JS on the stack).
There is a new onPopFrame handler, which is needed when doing reverse-step-in operations.
Operations on the debuggee which require interaction with the system will fail. These operations may be property accesses, evals, or object calls, and an example is accessing the "font" property of a CanvasRenderingContext2D. Failed operations currently just produce a placeholder "INCOMPLETE" result.
Operations on the debuggee which have side effects --- eval("x.f = 3") --- should be avoided. When the process resumes forward or backward these side effects will be lost (the process reverts to its original state) and while paused at a breakpoint these effects can cause some strange behavior --- after the above eval, getting the x.f property directly could produce a different value from eval("x.f"). While the strange behavior could be fixed (it's due to caching) it would be good to prevent or at least discourage users from performing such effectful operations.
The underlying object of a Debugger.Object is inaccessible; object.unsafeDereference is null. This can be fixed, see below.
As described above under "Debugger integration", child objects besides scripts and script sources become invalid when the debugger resumes execution, and must be reconstructed each time the replaying process pauses.

Inspecting a replaying process

Access to JS objects in the replaying process is currently only done through the JS Debugger interface --- Debugger.Object, Debugger.Frame.eval, and so forth. This seems to be fine for the debugger devtools panel (except the unsafeDereference thing above) but I don't know how useful this will be for other devtools panels. It would be possible to access objects in the replaying process more directly, by giving the debugger server proxies whose accesses are forwarded to their wrapped object in the replaying process. Some of the same caveats apply here as for the Debugger: accesses on the proxies which require interacting with the system will fail, and the proxies will become invalid when the replaying process resumes execution.