The IP Over NBMA (ION) Archive

Cell Relay Retreat>ION Archive>month:1996-Jul> msg00082



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

multipoint synchro protocol (long)

  • From: Eric Mannie <mannie@helios.iihe.ac.be>
  • Date: Thu, 18 Jul 1996 17:42:28 +0200
  • X400-Content-Type: P2-1984 (2)
  • X400-MTS-Identifier: [/PRMD=iihe/ADMD=rtt/C=be/;960718154228]
  • X400-Originator: mannie@helios.iihe.ac.be
  • X400-Received: by mta sun1.iihe.ac.be in /PRMD=IIHE/ADMD=RTT/C=BE/; Relayed; Thu, 18 Jul 1996 15:43:55 +0200
  • X400-Received: by mta elem5 in /PRMD=IIHE/ADMD=RTT/C=BE/; Relayed; Thu, 18 Jul 1996 17:42:28 +0200
  • X400-Received: by /PRMD=iihe/ADMD=rtt/C=be/; Relayed; Thu, 18 Jul 1996 17:42:28 +0200
  • X400-Recipients: non-disclosure:;

Hello,

We are working on cache synchronization for LIS, LAG, clusters, ... and 
we are defining a multicast synchronization protocol (MSP). A 
justification of our work and a brief description of the proposed protocol 
follows. All comments and suggestions are welcome. What do you think 
about the following protocol and ideas ?

Marc De Preter & Eric Mannie
Brussels University - STC
mannie@helios.iihe.rtt.be


Protocol justification :
--------------------------

Multicast Synchronization Protocol (MSP) - NBMA

This protocol is an attempt to define a multicast synchronization protocol 
in order to avoid problems related to the use of traditional unicast (pt-to-
pt) synchronization protocols (such as encountered with OSPF, P-NNI, 
SCSP, epidemic protocols,...). Such protocols imply the establishment and 
maintenance of a topology of servers (tree, star, line, mesh,...). It is not 
obvious to find neither the best topology for a given synchronization 
protocol nor the best algorithm which allows to create this topology. An 
attempt to study the influence of the spatial distribution (topology) has 
been made, for instance, in Epidemic algorithms by Xerox Parc, which 
showed interesting results. Moreover, traditional synchronization 
protocols notably imply a convergence time and traffic which is 
proportional to the size of the topology. We believe that reducing the 
topology to a set of members of a single multicast group could reduce 
both convergence time and traffic. Note that in that case, no configuration 
algorithm is required.

Brief description of the proposed protocol :
--------------------------------------------------

MSP allows to synchronize a replicated database between various servers 
which are all members of the same server group (SG). It takes advantage 
of multicast capabilities provided by a growing number of underlying 
layers. It is suitable in environments supporting multicast (group) 
addresses (such as ethernet and SMDS) or point-to-multipoint connections 
(such as ATM). In this context all servers can directly communicate with 
all servers in the same group, using the multicast capability. No particular 
topology (except the underlying multicast topology itself) nor 
configuration algorithm (such as the Spanning Tree,...) are required. No 
problem due to critical links and topology partitions occurs. This protocol 
is very robust as an update generated by a server is directly received by 
all other servers in the same group. Finally, MSP is a generic protocol 
and is defined independently of the particular database or cache to 
synchronize.

Each server is the owner of a part of the database (e.g. bindings by local 
clients) and has a unique server id. It maintains a copy of the complete 
database (replicated database), each entry is either locally owned or 
learned from another server. An entry which belongs to a particular 
server is tagged with a sequence number (event time) and the 
server id of its owner. The sequence number is unique in the 
context of the owner.

All servers are directly connected by a multicast group or a mesh of 
point-to-multipoint connections. In a near future, when multipoint-
to-multipoint ATM connections will be available, the need to 
support n point-to-multipoint ATM connections will be suppressed 
when n servers have to be synchronized. It should be noted that 
these connections are supported by servers only (never by clients) 
and that these servers could be implemented directly over ATM 
switches.

Each update generated by a server is sent with the server's id and 
the next value of the sequence number. This update is directly 
received by all other servers and each local database is updated 
accordingly. Updates are packed in pdus which are logically 
grouped into transactions. The beginning and the end of a 
transaction are marked by specific flags in the first and last pdu of 
that transaction.

Pdus are not positively acknowledged (ACK) by receiving servers. If 
a server detects a gap in the sequence numbers from another 
server it sends a NACK in the multicast group for the given set of 
sequence numbers. Transactions allow to speed up the detection of 
a gap when the last pdu is lost. In order to avoid an implosion of 
concurrent NACKs and reduce the total number of transmitted 
NACKs, a technique similar to IGMP is used. A server waits for a 
delay randomly choosen between zero and D seconds before 
sending a NACK. If, during this duration, another NACK is seen for 
the same set (or subset or superset) of sequence numbers and for 
the same server id, the NACK generation is cancelled and the server 
waits for the retransmitted updates. This scheme is based on the 
fact that if a pdu is lost in the context of a multicast group, there 
are probably more than one server which have not received the 
pdu. In addition, it should be noted that such pdus are essentially 
small pdus (dynamic of updates, scalling issue: number of clients 
per server) and that the expected error rate of the underlying layer 
could be very low (e.g. ATM).

Small hello pdus are generated periodically at each hello interval 
and include the server id and its current sequence number. Hello 
pdus are needed to detect the lost of a set of complete transactions.

A sequence number does not identify an information (binding) 
itself but an instance of this information. A binding for the same 
key (e.g. IP address) but with two different values (e.g. ATM 
addresses) has two different sequence numbers. A sequence 
number is a kind of timestamp and could be defined as the 
concatenation of a standard time (e.g. GMT) and a real sequence 
number.

When a server or a connection (re)starts, either all updates have 
been lost, or, only a part of the updates have not been received. The 
corresponding server sends a SYNC pdu to the multicast group, 
either indicating that all the database has to be retransmitted, or, 
listing each server id and its last received sequence number. SYNC 
are not sent directly; the same scheme as the one used for NACKs is 
used again; it allows a server to take advantage of the 
resynchronization requested by another server. Receiving servers 
send updates (transactions) for the missing information, and could 
use the same scheme as the one used with NACKs and SYNCs. Each 
server only sends updates for information it owns and only if their 
sequence number is greater than the requested one. If some 
sequence numbers are not available any longer, it means that the 
corresponding information has been updated by a more recent 
update and that only this last update has to be retransmitted by the 
owner. Only more recent updates are retransmitted. Selective SYNCs 
are resent if a part of the requested updates has not been received, 
obsolete sequence numbers are advertised.

If all members of a multicast group or a large part of these 
members have lost their connectivity at the same time, the 
previous scheme is very efficient (open issue: is it less frequent 
than with a point-to-point topology of servers). A complete 
(re)transmission of the database could possibly occur in a separate 
multicast group or even on a on-demand point-to-multipoint or 
point-to-point connection. If it occurs on the normal point-to-
multipoint connection or multicast group, we could argue that it 
adds robustness to the protocol.

For the configuration point of view, only a single multicast address 
or a list of server addresses has to be configured (e.g. obtained 
through a "LEC" like configuration server). No particular topology 
has to be build, no configuration algorithm is needed, no problem 
occurs if a server fails to rebuild the topology !

Finally, if we only consider a topology of servers connected by 
point-to-point connections, MSP acts more or less like a traditional 
synchronization protocol; except mainly that each update pdu 
doesn't need to be acked one by one, and that after a failure the  
complete database summary doesn't necessarely have to be 
exchanged in two ways each time. In this last case, only two small 
SYNC pdus are exchanged and each server acts as a proxy for the 
information owned by other servers. A summary of each database 
entry is not exchanged but only a short list of server ids and 
sequence numbers (events) is used.
 

  S1     S2     S3                    S1  S2
  |      |      |                      \  /\
  |      |      |                       \/  \
 +-----------------+                    S3  S4
 |                 |                    /
 | Multicast Group |---S4              /
 |                 |                  S5-----S6
 +-----------------+                  /
  |      |      |                    /
  |      |      |                   S7
  S5     S6     S7

       Multicast topology   vs. traditional topology