The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2001-Aug> msg00035



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]

  • From: neil.2.harrison@bt.com
  • Date: Sat, 4 Aug 2001 14:27:44 +0100
  • Cc: mpls@UU.NET

Ping wrote:
> Sent: 26 July 2001 01:22
> To: Brijesh Kumar
> Cc: mpls-list
> Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
> 
> 
> > Brijesh Kumar wrote:
> > 
> > From this draft:
> > 
> >    When the ingress LSR suspects that the LSP may have 
> failed and the
> >    RSVP control plane shows the LSP as operational, the ingress LSR
> > MUST
> >    send LSP-ping messages to the egress over the LSP, periodically.
> > The
> >    value of the time interval should be configurable.
> > 
> > Could you please tell me what do you mean by "LSR 
> suspects"? In other
> > words,
> > what is the triggering input for LSR to start "suspecting" 
> the LSP may
> > have failed?
> > 
> > It will be better if you replace "suspect" with "detect" in 
> your draft
> > as suspect is an imprecise human emotion and no operator would like
> > routers that have negative emotions ;-).
> > 
> 
> It's network operators suspect something may have gone wrong with some
> LSP's....
NH=> How do they 'suspect' this?....is some crystal-ball going to be
supplied with the implementation?  It is not acceptable to have to rely on
customers to 1st complain.  So what is the detection mechanism for the
defects, you need to explain this more precisely...see also later comments.
> 
> > In addition, I do have issues with introducing a mechanism dependent
> > on a particular signalling protocol for a generic problem - testing
> > liveness of the label switched data path. 
> 
> OK. Do you have problem with the mechanism itself?
NH=> I have some problems with it.
Coming back from 2 weeks leave and trying to wade through the large volume
of discussion this draft has generated tells me there is a fundamental lack
of clarity of the problem being solved here.  We really should get the
problem statement agreed and also agree a set of OAM arch principles to
prevent bad practice. In summary here is what I propose...and I would
appreciate some positive feedback on each of these statements if there is
disagreement:

1	Problem statement:

Operators need to determine the operational status of LSPs.  They need
simple management tools to do this and operational costs must be driven down
(ie we can't afford an army of CCIEs to run networks).  As a minimum, this
requires all defects to be identified and specified in terms of entry/exit
criteria and the appropriate consequent actions to be taken in order to (i)
protect customer traffic and (ii) allow simple diagnostics for operational
staff.  This is also required so that the operators/customers can define
consistent and measurable availability SLAs and, as a dependent corollary,
QoS SLAs (since QoS only has any meaning when the LSP is in the available
state).  Customer/operator SLAs are increasing in importance and this needs
to be recognised.

2	Arch principles:
Note - although stated against the MPLS case, these can (and should) be
generalised to apply to any network...including the GMPLS cases (where
perhaps they may be more obvious to some).

2.1	the data-plane and control-plane must have independent OAM.  Note
that the control-plane must use some data-plane (not necessarily the same as
that used for customer traffic) and hence needs both its own data-plane OAM
and OAM (aka as error handling) which is specific to each of the
control-plane protocols in use....so that would be specific OAM for RSVP-TE,
CR-LDP, LDP, OSPF, etc.

2.2	cross-layer network data-plane OAM dependancies must be avoided.  In
other words, one should not rely on OAM in technology X to detect/diagnose
defects in technology Y (whether X is a client or server layer wrt Y).  This
requirement is also necessary for evolution and backwards compatibility, ie
to allow layer networks/protocols to be added/removed/modified independently
and without affecting other layer networks/protocols.

2.3	cross-plane OAM dependancies must be avoided.  For example, the
data-plane OAM should function independent of the signalling/routing
protocols being used (including the case of no control-plane).  This has
similar evolution and backwards compatibilty issues as in 2.2.

2.4	customers must not be expected to act as the default defect
detection devices.  Nor must customer traffic activity be used as part of
the defect detection function (since users can be quiescent).  Note that
defect detection and defect diagnosis are 2 distinct functions.  This also
implies that defect detection needs to be a continuous function (at least on
important LSPs).
BTW - IMO QoS measurements are 2nd order importance compared to defect
handling.  QoS should be tackled as a network design and TE issue, with
suitable verification via population or ad hoc sampling as needed, and not
overkill measurements which can be both costly and swamp operations people.

2.5	OAM mechanisms should be designed to correctly function under
failure conditions, ie OAM mechanisms must not rely on 'well-behaved'
network behaviour after a defect has occurred since, by defintion, the fact
that the network has failed implies its behaviour cannot be expected to be
predictable.

2.6	A failure in a server layer network should not create alarm storms
in client layer networks.  Noting that client layer networks may be owned by
different organisations and have a large geographical dispersion.

2.7	data-plane LSP OAM should not require a return LSP to
function....and even with bi-directional LSPs each direction should be
monitored independently.

2.8	failures of the control-plane (either the data-plane aspect or any
one of the control-plane protocols that run on this) should not result in
failures of the customer traffic data-plane where such trails are permanent
or semi-permanent (this is more obvious in the GMPLS case).

The Ping draft either does not address or fails to meet many of these
requirements.  More specifically:
-	it does not identify/specify any defects;
-	it cannot detect any defects unless it is running continuously;
-	it could not detect or diagnose all potential defects even if
running continously, eg mismerging defects;
-	it relies on both cross-layer (ie ICMP is a IP layer OAM mechanism)
and cross-plane (ie it is only defined for the RSVP-TE case) functional
dependancies.  Although this expediency may have some near-term benefit(?)
it is not good arch practice.  
-	since defects are not detected/defined it is unclear (i) what
consequent actions are to be taken to protect customer traffic (none it
seems), (ii) how this will help operators target diagnostic activities or
(iii) how it will prevent client layer alarm storms.
-	its all very well having variable behaviour (wrt timers) but this
will create complexities elsewhere, ie inconsistent entry/exit criteria for
defects and  availability states, no clear/consistent datums for when QoS
metrics collection (if being measured) are valid, ie when to start/stop QoS
metric aggregation.  So this means complexities in SLA derivations.

So, whilst I would certainly not want to stop any vendor implementing this
draft, as an operator with responsibilities for the integrity of my
customers traffic and a need to define SLAs for such, I find it
professionally very difficult to condone it becoming a 'standard'.

> 
> > Secondly, can we have some
> > input/justification with regards to the assumption that there is a
> > need to detect such a failure since only cause for such a failure
> > given in the draft is "memory Corruption". What causes so called
> > memory corruption? Are we trying to repair badly written code here?
> > 
> 
> The problem did happen in the real network. I'm not going to point
> fingers on the cause of the problem. But we (vendors and 
> providers) did
> realize from that incidence that we need to have a simple mechanism to
> check LSP's data plane in today's operational networks.
NH=> Some time ago I was hearing comments such as 'these defects won't
happen' and 'don't fix software bugs with additional software' (an erroneous
argument anyway).  So having now establised that defects do exist, I fully
agree we need some way of detecting/diagnosing such defects.  But this seems
not what is being advocated here, since the mechanism does not help
operators to detect the defects in the first place.  To address the
detection aspect some mechanism has to run continuously, and a technique
that is unidirectional would seem to be simpler/better than one that
requires a return path.  At some point one needs to raise an alarm and take
other consequent actions (eg maybe even squelching customer traffic if
intergity/security could be being compromised, and stop any QoS metric
aggregation (if being used)) and raise a vertical indication to the NMS...so
what better place to do this than at the point where the defect is being
initially detected, ie the LSP sink?

Given all this, could those advocating the Ping draft please explain what is
actually wrong with the more comprehensive solution we orginally proposed
(in draft-harrison-mpls-oam-00.txt).....and in particular, the use of the CV
flow.....which is really nothing more than a keepalive running from soure to
sink with a unique source identifier and deterministic behaviour....surely
it can't be much simpler than this can it? 
> 
> The whole purpose on various proposals (fast reroute, graceful restart
> and LSP-ping) is to make the whole network working nicely, 
> even when we
> have routers from other vendors that have badly written code.
> 
> - Ping
>