<< Prev | - Up - | Next >> |
Distributed systems have the partial failure property, that is, part of the system can fail while the rest continues to work. Partial failures are not at all rare. Properly-designed applications must take them into account. This is both good and bad for application design. The bad part is that it makes applications more complex. The good part is that applications can take advantage of the redundancy offered by distributed systems to become more robust.
The Mozart failure model defines what failures are recognized by the system and how they are reflected in the language. The system recognizes permanent site failures that are instantaneous and both temporary and permanent communication failures. The permanent site failure mode is more generally known as fail-silent with failure detection, that is, a site stops working instantaneously, does not communicate with other sites from that point onwards, and the stop can be detected from the outside. The system provides mechanisms to program with language entities that are subject to failures.
The Mozart failure model is accessed through the module Fault
. This chapter explains and justifies this functionality, and gives examples showing how to use it. We present the failure model in two steps: the basic model and the advanced model. To start writing fault-tolerant applications it is enough to understand the basic model. To build fault-tolerant abstractions it is often necessary to use the advanced model.
In its current state, the Mozart system provides only the primitive operations needed to detect failure and reflect it in the language. The design and implementation of fault-tolerant abstractions within the language by using these primitives is the subject of ongoing research. This chapter and the next one give the first results of this research. All comments and suggestions for improvements are welcome.
All failure modes are defined with respect to both a language entity and a particular site. For example, one would like to send a message to a port from a given site. The site may or may not be able to send the message. A language entity can be in three fault states on a given site:
The entity works normally (local fault state ok
).
The entity is temporarily not working (local fault state tempFail
). This is because a remote site crucial to the entity is currently unreachable due to a network problem. This fault state can go away. A limitation of the current release is that temporary problems are indicated only after a long delay time.
The entity is permanently not working (local fault state permFail
). This is because a site crucial to the entity has crashed. This fault state is permanent.
If the entity is currently not working, then it is guaranteed that the fault state will be either tempFail
or permFail
. The system cannot always determine whether a fault is temporary or permanent. In particular, a tempFail
may hide a site crash. However, network failures can always be considered temporary since the system actively tries to reestablish another connection.
The fault state tempFail
exists to allow the application to react quickly to temporary network problems. It is raised by the system as soon as a network problem is recognized. It is therefore fundamentally different from a time-out. For example, TCP gives a time-out after some minutes. This duration has been chosen to be very long, approximating infinity from the viewpoint of the network connection. After the time-out, one can be sure that the connection is no longer working.
The purpose of tempFail
is quite different from a time-out. It is to inform the application of network problems, not to mark the end of a connection. For example, an application might be connected to a given server. If there are problems with this server, the application would like to be informed quickly so that it can try connecting to another server. A tempFail
fault state will therefore be relatively frequent, much more frequent than a time-out. In most cases, a tempFail
fault state will eventually go away.
It is possible for a tempFail
state to last forever. For example, if a user disconnects the network connection of a laptop machine, then only he or she knows whether the problem is permanent. The application cannot in general know this. The decision whether to continue waiting or to stop the wait can cut through all levels of abstraction to appear at the top level (i.e., the user). The application might then pop up a window to ask the user whether to continue waiting or not. The important thing is that the network layer does not make this decision; the application is completely free to decide or to let the user decide.
The local fault states ok
, tempFail
, and permFail
say whether an entity operation can be performed locally. An entity can also contain information about the fault states on other sites. For example, say the current site is waiting for a variable binding, but the remote site that will do the binding has crashed. The current site can find this out. The following remote problems are identified:
At least one of the other sites referencing the entity can no longer perform operations on the entity (fault state remoteProblem(permSome)
). The sites may or may not have crashed.
All of the other sites referencing the entity can no longer perform operations on the entity (fault state remoteProblem(permAll)
). The sites may or may not have crashed.
At least one of the other sites referencing the entity is currently unreachable (fault state remoteProblem(tempSome)
).
All of the other sites referencing the entity are currently unreachable (fault state remoteProblem(tempAll)
).
All of these cases are identified by the fault state remoteProblem(I)
, where the argument I
identifies the problem. A permanent remote problem never goes away. A temporary remote problem can go away, just like a tempFail
.
Even if there exists a remote problem, it is not always possible to return a remoteProblem
fault state. This happens if there are problems with a proxy that the owner site does not know about. This also happens if the owner site is inaccessible. In that case it might not be possible to learn anything about the remote sites.
The complete fault state of an entity consists of at most one element from the set {tempFail
, permFail
} together with at most two elements from the set {remoteProblem(permSome)
, remoteProblem(permAll)
, remoteProblem(tempSome)
, remoteProblem(tempAll)
}. Permanent remote problems mask temporary ones, i.e., if remoteProblem(permSome)
is detected then remoteProblem(tempSome)
cannot be detected. If a (temporary or permanent) problem exists on all remote sites (e.g., remoteProblem(permAll)
) then the problem also exists on some sites (e.g., remoteProblem(permSome)
).
We present the failure model in two steps: the basic model and the advanced model. The simplest way to start writing fault-tolerant applications is to use the basic model. The basic model allows to enable or disable synchronous exceptions on language entities. That is, attempting to perform operations on entities with faults will either block or raise an exception without doing the operation. The fault detection can be enabled separately on each of two levels: a per-site level or a per-thread level (see Section 4.2.4).
Exceptions can be enabled on logic variables, ports, objects, cells, and locks. All other entities, e.g., records, procedures, classes, and functors, will never raise an exception since they have no fault states (see Section 4.4.1). Attempting to enable an exception on such an entity is allowed but has no observable effect.
The advanced model allows to install or deinstall handlers and watchers on entities. These are procedures that are invoked when there is a failure. Handlers are invoked synchronously (when attempting to perform an operation) and watchers are invoked asynchronously (in their own thread as soon as the fault state is known). The advanced model is explained in Section 4.3.
By default, new entities are set up so that an exception will be raised on fault states tempFail
or permFail
. The following operations are provided to do other kinds of fault detection:
fun {Fault.defaultEnable FStates}
sets the site's default for detected fault states to FStates
. Each site has a default that is set independently of that of other sites. Enabling site or thread-level detection for an entity overrides this default. Attempting to perform an operation on an entity with a fault state in the default FStates
raises an exception. The FStates
can be changed as often as desired. When the system starts up, the defaults are set up as if the call {Fault.defaultEnable [tempFail permFail]}
had been done.
fun {Fault.defaultDisable}
disables the default fault detection. This function is included for symmetry. It is exactly the same as doing {Fault.defaultEnable nil}
.
fun {Fault.enable Entity Level FStates}
is a more targeted way to do fault detection. It enables fault detection on a given entity at a given level. If a fault in FStates
occurs while attempting an operation at the given level, then an exception is raised instead of doing the operation. The Entity
is a reference to any language entity. Exceptions are enabled only if the entity is an object, cell, port, lock, or logic variable. The Level
is site
, 'thread'(this)
(for the current thread), or 'thread'(T)
(for an arbitrary thread identifier T
).1 More information on levels is given in Section 4.2.4.
fun {Fault.disable Entity Level}
disables fault detection on the given entity at the given level. If a fault occurs, then the system does nothing at the given level, but checks whether any exceptions are enabled at the next higher level. This is not the same as {Fault.enable Entity Level nil}
, which always causes the entity to block at the given level.
The function Fault.enable
returns true
if and only if the enable was successful, i.e., the entity was not already enabled for that level. The function Fault.disable
returns true
if and only if the disable was successful, i.e., the entity was already enabled for that level. The functions Fault.defaultEnable
and Fault.defaultDisable
always return true
. At its creation, an entity is not enabled at any level. All four functions raise a type error exception if their arguments do not have the correct type.
A logic variable can be declared before it is bound. What happens to its enabled exceptions when it is bound? For example, let's say variable V
is enabled with FS_v
and port P
is enabled with FS_p
. What happens after the binding V=P
? In this case, the binding gives P
, which keeps the enabled exceptions FS_p
. The enabled exceptions FS_v
are discarded.
The following cases are possible. We assume that variable V
is enabled with fault detection on fault states FS_v
.
V
is bound to a nonvariable entity X
that has no enabled exceptions. In this case, the enabled exceptions FS_v
are transferred to X
.
V
is bound to a nonvariable entity X
that already has enabled exceptions FS_x
. In this case, X
keeps its enabled exceptions and FS_v
is discarded.
V
is bound to another logic variable W
that might have enabled exceptions. In this case, the resulting variable keeps one set of enabled exceptions, either FS_v
or FS_w
(if the latter exists). Which one is not specified.
These cases follow from three basic principles:
A logic variable that is "observed", e.g., it has fault detection with enabled exceptions, will be "observed" at all instants of time. That is, it will keep some kind of fault detection even after it is bound.
A nonvariable entity is never bothered by being bound to a variable. That is, the nonvariable's fault detection (if there is any) can only be modified by explicit commands from Fault
, never from being bound to a variable.
Any language entity that is set up with a set of enabled exceptions will have exactly one set of enabled exceptions, even if it is bound. There is no attempt to "combine" the two sets.
The exceptions raised have the format
system(dp(entity:E conditions:FS op:OP) ...)
where the four arguments are defined as follows:
E
is the entity on which the operation was attempted. A temporary limitation of the current release is that if the entity is an object, then E
is undefined.
FS
is the list of actual fault states occurring at the site on which the operation was attempted. This list is a subset of the list for which fault detection was enabled. Each fault state in FS
may have an extra field info
that gives additional information about the fault. The possible elements of FS
are currently the following:
tempFail(info:I)
and permFail(info:I)
, where I
is in {state
, owner
}. The info
field only exists for objects, cells, and locks.
remoteProblem(tempSome)
, remoteProblem(permSome)
, remoteProblem(tempAll)
, and remoteProblem(permAll)
.
OP
indicates which attempted operation caused the exception. The possible values of OP
are currently:
For logic variables: bind(T)
, wait
, and isDet
, where T
is what the variable was attempted to be bound with.
For cells: cellExchange(Old New)
, cellAssign(New)
, and cellAccess(Old)
, where Old
is the cell content before the attempted operation and New
is the cell content after the attempted operation.
For locks: 'lock'
.2
For ports: send(Msg)
, where Msg
is the message attempted to be sent to the port.
For objects: objectExchange(Attr Old New)
, objectAssign(Attr New)
, objectAccess(Attr Old)
, and objectFetch
, where Attr
is the name of the object attribute, Old
is the attribute value before the attempted operation, and New
is the attribute value after the attempted operation. A limitation of the current release is that the attempted operation cannot be retried. The objectFetch
operation exists because object-records are copied lazily: the first time the object is used, the object-record is fetched over the network, which might fail.
There are three levels of fault detection, namely default site-based, site-based, and thread-based. A more specific level, if it exists, overrides a more general level. The most general is default site-based, which determines what exceptions are raised if the entity is not enabled at the site or thread level. Next is site-based, which detects a fault for a specific entity when an operation is tried on one particular site. Finally, the most fine-grained is thread-based, which detects a fault for a specific entity when an operation is tried in a particular thread.
The site-based and thread-based levels have to be enabled specifically for a given entity. The function {Fault.enable Entity Level FStates}
is used, where Level
is either site
or 'thread'(T)
. The thread T
is either the atom this
(which means the current thread), or a thread identifier. Any thread's identifier can be obtained by executing {Thread.this T}
in the thread.
The thread-based level is the most specific; if it is enabled it overrides the two others in its thread. The site-based level, if it is enabled, overrides the default. If neither a thread-based nor a site-based level are enabled, then the default is used. Even if the actual fault state does not give an exception, the mere fact that a level is enabled always overrides the next higher level.
For example, assume that the cell C
is on a site with default detection for [tempFail permFail]
and thread-based detection for [permFail]
in thread T1
. What happens if many threads try to do an exchange if C's actual fault state is tempFail
? Then thread T1
will block, since it is set up to detect only permFail
. All other threads will raise the exception tempFail
, since the default covers it and there is no enable at the site or thread levels. Thread T1
will continue the operation when and if the tempFail
state goes away.
The Fault
module has both sited and unsited operations. Both setting the default and enabling at the site level are sited. This protects the site from remote attempts to change its settings. Enabling at the thread level is unsited. This allows fault-tolerant abstractions to be network-transparent, i.e., when passed to another site they continue to work.
To be precise, the calls {Fault.enable E site ...}
and {Fault.install E site ...}
, will only work on the home site of the Fault
module. A procedure containing these calls may be passed around the network at will, and executed anywhere. However, any attempt to execute either call on a site different from the Fault
module's home site will raise an exception.3 The calls {Fault.enable E 'thread'(T) ...}
and {Fault.install E 'thread'(T) ...}
will work anywhere. A procedure containing these calls may be passed around the network at will, and will work correctly anywhere. Of course, since threads are sited, T
has to identify a thread on the site where the procedure is executed.
The basic model lets you set up the system to raise an exception when an operation is attempted on a faulty entity. The advanced model extends this to call a user-defined procedure. Furthermore, the advanced model can call the procedure synchronously, i.e., when an operation is attempted, or asynchronously, i.e., as soon as the fault is known, even if no operation is attempted. In the synchronous case, the procedure is called a fault handler, or just handler. In the asynchronous case, the procedure is called watcher.
When an operation is attempted on an entity with a problem, then a handler call replaces the operation. This call is done in the context of the thread that attempted the operation. If the entity works again later (which is possible with tempFail
and remoteProblem
) then the handler can just try the operation again.
In an exact analogy to the basic model, a fault handler can be installed on a given entity at a given level for a given set of fault states. The possible entities, levels, and fault states are exactly the same. What happens to handlers on logic variables when the variables are bound is exactly the same as what happens to enabled exceptions in Section 4.2.2. For example, when a variable with handler H_v1
is bound to another variable with handler H_v2
, then the result has exactly one handler, say H_v2
. The other handler H_v1
is discarded. When a variable with handler is bound to a port with handler, then the port's handler survives and the variable's handler is discarded.
Handlers are installed and deinstalled with the following two built-in operations:
fun {Fault.install Entity Level FStates HandlerProc}
installs handler HandlerProc
on Entity
at Level
for fault states FStates
. If an operation is attempted and there is a fault in FStates
, then the operation is replaced by a call to HandlerProc
. At most one handler can be installed on a given entity at a given level.
fun {Fault.deInstall Entity Level}
deinstalls a previously installed handler from Entity
at Level
.
The function Fault.install
returns true
if and only if the installation was successful, i.e., the entity did not already have an installation or an enable for that level. The function Fault.deInstall
returns true
if and only if the deinstall was successful, i.e., the entity had a handler installed for that level. Both functions raise a type error exception if their arguments do not have the correct type.
A handler HandlerProc
is a three-argument procedure that is called as {HandlerProc E FS OP}
. The arguments E
, FS
, and OP
, are exactly the same as in a distribution exception.
A modification of the current release with respect to handler semantics is that handlers installed on variables always retry the operation after they return.
Fault handlers detect failure synchronously, i.e., when an operation is attempted. One often wants to be informed earlier. The advanced model allows the application to be informed asynchronously and eagerly, that is, as soon as the site finds out about the failure. Two operations are provided:
fun {Fault.installWatcher Entity FStates WatcherProc}
installs watcher WatcherProc
on Entity
for fault states FStates
. If a fault in FStates
is detected on the current site, then WatcherProc
is invoked in its own new thread. A watcher is automatically deinstalled when it is invoked. Any number of watchers can be installed on an entity. The function always returns true
, since it is always possible to install a watcher.
fun {Fault.deInstallWatcher Entity WatcherProc}
deinstalls (i.e., removes) one instance of the given watcher from the entity on the current site. If no instance of WatcherProc
is there to deinstall, then the function returns false
. Otherwise, it returns true
.
A watcher WatcherProc
is a two-argument procedure that is called as {WatcherProc E FS}
. The arguments E
and FS
are exactly the same as in a distribution exception or in a handler call.
This section explains the possible fault states of each language entity in terms of its distributed semantics. The fault state is a consequence of two things: the entity's distributed implementation and the system's failure mode. For example, let's consider a variable. There is one owner site and a set of proxy sites. If a variable proxy is on a crashed site and the owner site is still working, then to another variable proxy this will be a remoteProblem
. If the owner site crashes, then all proxies will see a permFail
.
Eager stateless data, namely records, procedures, functions, classes, and functors, are copied immediately in messages. There are no remote references to eager stateless data, which are always local to a site. So their only possible fault state is ok
.
In future releases, procedures, functions, and functors will not send their code immediately in the message, but will send only their global name. Upon arrival, if the code is not present, then it will be immediately requested. This will guarantee that code is sent at most once to a given site. This will introduce fault states tempFail
and permFail
if the site containing the code becomes unreachable or crashes.
Sited entities can be referenced remotely but can only be used on their home site. Attempting to use one outside of its home site immediately raises an exception. Detecting this does not need any network operations. So their only possible fault state is ok
.
A port has one owner site and a set of proxy sites. The following failure modes are possible:
Normal operation (ok
).
Owner site down (permFail
and remoteProblem(I)
where I
is both permSome
and permAll
).
Owner site unreachable (tempFail
).
A port has a single operation, Send
, which can complete if the fault state is ok
. The Send
operation is asynchronous, that is, it completes immediately on the sender site and at some later point in time it will complete on the port's owner site. The fact that it completes on the sender site does not imply that it will complete on the owner site. This is because the owner site may fail.
A logic variable has one owner site and a set of proxy sites. The following failure modes are possible:
Normal operation (ok
).
Owner site down (permFail
and remoteProblem(I)
where I
is both permSome
and permAll
).
Owner site unreachable (tempFail
).
Some or all proxy sites down (remoteProblem(I)
where I
is both permSome
and permAll
).
Some or all proxy sites unreachable (remoteProblem(tempSome))
). It is impossible to have remoteProblem(tempAll)
in the current implementation.
A logic variable has two operations, binding and waiting until bound. Bind operations are explicit in the program text. Most wait operations are implicit since threads block until their data is available. The bind operation will only complete if the fault state is ok
or remoteProblem
.
If the application binds a variable, then its wait operation is only guaranteed to complete if the fault state is ok
. When it completes, this means that another proxy has bound the variable. If the fault state is remoteProblem
, then the wait operation may not be able to complete if the problem exists at the proxy that was supposed to bind the variable. This is not a tempFail
or permFail
, since a third proxy can successfully bind the variable. But from the application's viewpoint, it may still be important to know about this problem. Therefore, the fault state remoteProblem
is important for variables.
A common case for variables is the client-server. The client sends a request containing a variable to the server. The server binds the variable to the answer. The variable exists only on the client and server sites. In this case, if the client detects a remoteProblem
then it knows that the variable binding will be delayed or never done.
Cells and locks have almost the same failure behavior. A cell or lock has one owner site and a set of proxy sites. At any given time instant, the cell's state pointer or the lock's token is at one proxy or in the network. The following failure modes are possible:
Normal operation (ok
).
State pointer not present and owner site down (permFail(info:owner)
and remoteProblem(permSome)
).
State pointer not present and owner site unreachable (tempFail(info:owner)
).
State pointer lost and owner site up (permFail(info:state)
, remoteProblem(permAll)
, and remoteProblem(permSome)
). This failure mode is only possible for cells. If a lock token is lost then the owner recreates it.
State pointer unreachable and owner site up (tempFail(info:state)
).
State pointer present and owner site down (remoteProblem(permAll)
and remoteProblem(permSome)
).
State pointer present and owner site unreachable (remoteProblem(tempAll)
and remoteProblem(tempSome)
).
A cell has one primitive operation, a state update, which is called Exchange
. A lock has two implicit operations, acquiring the lock token and releasing it. Both are implemented by the same distributed protocol.
An object consists of an object-record that is a lazy chunk and that references the object's features, a cell, and a class. The object-record is lazy: it is copied to the site when the object is used for the first time. This means that the following failure modes are possible:
Normal operation (ok
).
Object-record or state pointer not present and owner site down (permFail(info:owner)
and remoteProblem(permSome)
).
Object-record or state pointer not present and owner site unreachable (tempFail(info:owner)
).
State pointer lost and owner site up (permFail(info:state)
, remoteProblem(permAll)
, and remoteProblem(permSome)
).
State pointer unreachable and owner site up (tempFail(info:state)
).
Object-record and state pointer present and owner site down (remoteProblem(permAll)
and remoteProblem(permSome)
).
Object-record and state pointer present and owner site unreachable (remoteProblem(tempAll)
and remoteProblem(tempSome)
).
Compared to cells, objects have two new failure modes: the object-record can be temporarily or permanently absent. In both cases the object cannot be used, so we simply consider the new failure modes to be instances of tempFail
and permFail
.
<< Prev | - Up - | Next >> |
thread
is already used as a keyword in the language, it has to be quoted to make it an atom.lock
is already used as a keyword in the language, it has to be quoted to make it an atom.Fault
module.