The IP Over NBMA (ION) Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] multipoint synchro protocol (long)
Hello,
We are working on cache synchronization for LIS, LAG, clusters, ... and
we are defining a multicast synchronization protocol (MSP). A
justification of our work and a brief description of the proposed protocol
follows. All comments and suggestions are welcome. What do you think
about the following protocol and ideas ?
Marc De Preter & Eric Mannie
Brussels University - STC
mannie@helios.iihe.rtt.be
Protocol justification :
--------------------------
Multicast Synchronization Protocol (MSP) - NBMA
This protocol is an attempt to define a multicast synchronization protocol
in order to avoid problems related to the use of traditional unicast (pt-to-
pt) synchronization protocols (such as encountered with OSPF, P-NNI,
SCSP, epidemic protocols,...). Such protocols imply the establishment and
maintenance of a topology of servers (tree, star, line, mesh,...). It is not
obvious to find neither the best topology for a given synchronization
protocol nor the best algorithm which allows to create this topology. An
attempt to study the influence of the spatial distribution (topology) has
been made, for instance, in Epidemic algorithms by Xerox Parc, which
showed interesting results. Moreover, traditional synchronization
protocols notably imply a convergence time and traffic which is
proportional to the size of the topology. We believe that reducing the
topology to a set of members of a single multicast group could reduce
both convergence time and traffic. Note that in that case, no configuration
algorithm is required.
Brief description of the proposed protocol :
--------------------------------------------------
MSP allows to synchronize a replicated database between various servers
which are all members of the same server group (SG). It takes advantage
of multicast capabilities provided by a growing number of underlying
layers. It is suitable in environments supporting multicast (group)
addresses (such as ethernet and SMDS) or point-to-multipoint connections
(such as ATM). In this context all servers can directly communicate with
all servers in the same group, using the multicast capability. No particular
topology (except the underlying multicast topology itself) nor
configuration algorithm (such as the Spanning Tree,...) are required. No
problem due to critical links and topology partitions occurs. This protocol
is very robust as an update generated by a server is directly received by
all other servers in the same group. Finally, MSP is a generic protocol
and is defined independently of the particular database or cache to
synchronize.
Each server is the owner of a part of the database (e.g. bindings by local
clients) and has a unique server id. It maintains a copy of the complete
database (replicated database), each entry is either locally owned or
learned from another server. An entry which belongs to a particular
server is tagged with a sequence number (event time) and the
server id of its owner. The sequence number is unique in the
context of the owner.
All servers are directly connected by a multicast group or a mesh of
point-to-multipoint connections. In a near future, when multipoint-
to-multipoint ATM connections will be available, the need to
support n point-to-multipoint ATM connections will be suppressed
when n servers have to be synchronized. It should be noted that
these connections are supported by servers only (never by clients)
and that these servers could be implemented directly over ATM
switches.
Each update generated by a server is sent with the server's id and
the next value of the sequence number. This update is directly
received by all other servers and each local database is updated
accordingly. Updates are packed in pdus which are logically
grouped into transactions. The beginning and the end of a
transaction are marked by specific flags in the first and last pdu of
that transaction.
Pdus are not positively acknowledged (ACK) by receiving servers. If
a server detects a gap in the sequence numbers from another
server it sends a NACK in the multicast group for the given set of
sequence numbers. Transactions allow to speed up the detection of
a gap when the last pdu is lost. In order to avoid an implosion of
concurrent NACKs and reduce the total number of transmitted
NACKs, a technique similar to IGMP is used. A server waits for a
delay randomly choosen between zero and D seconds before
sending a NACK. If, during this duration, another NACK is seen for
the same set (or subset or superset) of sequence numbers and for
the same server id, the NACK generation is cancelled and the server
waits for the retransmitted updates. This scheme is based on the
fact that if a pdu is lost in the context of a multicast group, there
are probably more than one server which have not received the
pdu. In addition, it should be noted that such pdus are essentially
small pdus (dynamic of updates, scalling issue: number of clients
per server) and that the expected error rate of the underlying layer
could be very low (e.g. ATM).
Small hello pdus are generated periodically at each hello interval
and include the server id and its current sequence number. Hello
pdus are needed to detect the lost of a set of complete transactions.
A sequence number does not identify an information (binding)
itself but an instance of this information. A binding for the same
key (e.g. IP address) but with two different values (e.g. ATM
addresses) has two different sequence numbers. A sequence
number is a kind of timestamp and could be defined as the
concatenation of a standard time (e.g. GMT) and a real sequence
number.
When a server or a connection (re)starts, either all updates have
been lost, or, only a part of the updates have not been received. The
corresponding server sends a SYNC pdu to the multicast group,
either indicating that all the database has to be retransmitted, or,
listing each server id and its last received sequence number. SYNC
are not sent directly; the same scheme as the one used for NACKs is
used again; it allows a server to take advantage of the
resynchronization requested by another server. Receiving servers
send updates (transactions) for the missing information, and could
use the same scheme as the one used with NACKs and SYNCs. Each
server only sends updates for information it owns and only if their
sequence number is greater than the requested one. If some
sequence numbers are not available any longer, it means that the
corresponding information has been updated by a more recent
update and that only this last update has to be retransmitted by the
owner. Only more recent updates are retransmitted. Selective SYNCs
are resent if a part of the requested updates has not been received,
obsolete sequence numbers are advertised.
If all members of a multicast group or a large part of these
members have lost their connectivity at the same time, the
previous scheme is very efficient (open issue: is it less frequent
than with a point-to-point topology of servers). A complete
(re)transmission of the database could possibly occur in a separate
multicast group or even on a on-demand point-to-multipoint or
point-to-point connection. If it occurs on the normal point-to-
multipoint connection or multicast group, we could argue that it
adds robustness to the protocol.
For the configuration point of view, only a single multicast address
or a list of server addresses has to be configured (e.g. obtained
through a "LEC" like configuration server). No particular topology
has to be build, no configuration algorithm is needed, no problem
occurs if a server fails to rebuild the topology !
Finally, if we only consider a topology of servers connected by
point-to-point connections, MSP acts more or less like a traditional
synchronization protocol; except mainly that each update pdu
doesn't need to be acked one by one, and that after a failure the
complete database summary doesn't necessarely have to be
exchanged in two ways each time. In this last case, only two small
SYNC pdus are exchanged and each server acts as a proxy for the
information owned by other servers. A summary of each database
entry is not exchanged but only a short list of server ids and
sequence numbers (events) is used.
S1 S2 S3 S1 S2
| | | \ /\
| | | \/ \
+-----------------+ S3 S4
| | /
| Multicast Group |---S4 /
| | S5-----S6
+-----------------+ /
| | | /
| | | S7
S5 S6 S7
Multicast topology vs. traditional topology
|
|