The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2001-Aug> msg00037



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]

  • From: Robert Raszuk <raszuk@cisco.com>
  • Date: Sat, 04 Aug 2001 14:25:12 -0700
  • CC: pingpan@juniper.net, Brijesh@coronanetworks.com, mpls@UU.NET
  • Organization: Signature: http://www.employees.org/~raszuk/sig/

Hi Neil,

> Given all this, could those advocating the Ping draft please explain what is
> actually wrong with the more comprehensive solution we orginally proposed
> (in draft-harrison-mpls-oam-00.txt).....

Let me try ...

First let me say that your draft seems to be architecturally perfect.
Unfortunately in practice in may meet the following obstacles:

* It is very CPU intensive. It requires per each LSP continues
processing of incomming CV packets in the transit nodes of such an LSP.
Imagine the case that you have 15000 of LSPs in a transit node and each
send you periodically a CV packet. (Ping's draft does not require any
CPU work per transt LSP).

* In the failure detection situation I don't see how it guarantees the
successful notification propagation the head. I can easily think of
cases when your returning FDI or BDI signals get lost due to some purly
forwarding bugs. (Ping's draft uses a control channel for return which
is working correctly).

* It requires 100% of software upgrade in all routes in a given network
(I am sure you realize quite a challange in large networks). 

* It requires simultanues support from all vendors on a given network.
(Is this realistic :) ?

* With all of the complications it does not seems to guarantee the
detection of LDP LSPs failures - only TE LSPs failures. (Sure Ping does
not address this neither yet but is way much more simpler).

* I don't see in the Ping's draft any requirement for end user = client
action - to detect the LSP failure - since you have brought this
argument a lot could you elaborate a bit where in his draft you see such
a need ?

Rgs,
Robert


> neil.2.harrison@bt.com wrote:
> 
> Ping wrote:
> > Sent: 26 July 2001 01:22
> > To: Brijesh Kumar
> > Cc: mpls-list
> > Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
> >
> >
> > > Brijesh Kumar wrote:
> > >
> > > From this draft:
> > >
> > >    When the ingress LSR suspects that the LSP may have
> > failed and the
> > >    RSVP control plane shows the LSP as operational, the ingress LSR
> > > MUST
> > >    send LSP-ping messages to the egress over the LSP, periodically.
> > > The
> > >    value of the time interval should be configurable.
> > >
> > > Could you please tell me what do you mean by "LSR
> > suspects"? In other
> > > words,
> > > what is the triggering input for LSR to start "suspecting"
> > the LSP may
> > > have failed?
> > >
> > > It will be better if you replace "suspect" with "detect" in
> > your draft
> > > as suspect is an imprecise human emotion and no operator would like
> > > routers that have negative emotions ;-).
> > >
> >
> > It's network operators suspect something may have gone wrong with some
> > LSP's....
> NH=> How do they 'suspect' this?....is some crystal-ball going to be
> supplied with the implementation?  It is not acceptable to have to rely on
> customers to 1st complain.  So what is the detection mechanism for the
> defects, you need to explain this more precisely...see also later comments.
> >
> > > In addition, I do have issues with introducing a mechanism dependent
> > > on a particular signalling protocol for a generic problem - testing
> > > liveness of the label switched data path.
> >
> > OK. Do you have problem with the mechanism itself?
> NH=> I have some problems with it.
> Coming back from 2 weeks leave and trying to wade through the large volume
> of discussion this draft has generated tells me there is a fundamental lack
> of clarity of the problem being solved here.  We really should get the
> problem statement agreed and also agree a set of OAM arch principles to
> prevent bad practice. In summary here is what I propose...and I would
> appreciate some positive feedback on each of these statements if there is
> disagreement:
> 
> 1       Problem statement:
> 
> Operators need to determine the operational status of LSPs.  They need
> simple management tools to do this and operational costs must be driven down
> (ie we can't afford an army of CCIEs to run networks).  As a minimum, this
> requires all defects to be identified and specified in terms of entry/exit
> criteria and the appropriate consequent actions to be taken in order to (i)
> protect customer traffic and (ii) allow simple diagnostics for operational
> staff.  This is also required so that the operators/customers can define
> consistent and measurable availability SLAs and, as a dependent corollary,
> QoS SLAs (since QoS only has any meaning when the LSP is in the available
> state).  Customer/operator SLAs are increasing in importance and this needs
> to be recognised.
> 
> 2       Arch principles:
> Note - although stated against the MPLS case, these can (and should) be
> generalised to apply to any network...including the GMPLS cases (where
> perhaps they may be more obvious to some).
> 
> 2.1     the data-plane and control-plane must have independent OAM.  Note
> that the control-plane must use some data-plane (not necessarily the same as
> that used for customer traffic) and hence needs both its own data-plane OAM
> and OAM (aka as error handling) which is specific to each of the
> control-plane protocols in use....so that would be specific OAM for RSVP-TE,
> CR-LDP, LDP, OSPF, etc.
> 
> 2.2     cross-layer network data-plane OAM dependancies must be avoided.  In
> other words, one should not rely on OAM in technology X to detect/diagnose
> defects in technology Y (whether X is a client or server layer wrt Y).  This
> requirement is also necessary for evolution and backwards compatibility, ie
> to allow layer networks/protocols to be added/removed/modified independently
> and without affecting other layer networks/protocols.
> 
> 2.3     cross-plane OAM dependancies must be avoided.  For example, the
> data-plane OAM should function independent of the signalling/routing
> protocols being used (including the case of no control-plane).  This has
> similar evolution and backwards compatibilty issues as in 2.2.
> 
> 2.4     customers must not be expected to act as the default defect
> detection devices.  Nor must customer traffic activity be used as part of
> the defect detection function (since users can be quiescent).  Note that
> defect detection and defect diagnosis are 2 distinct functions.  This also
> implies that defect detection needs to be a continuous function (at least on
> important LSPs).
> BTW - IMO QoS measurements are 2nd order importance compared to defect
> handling.  QoS should be tackled as a network design and TE issue, with
> suitable verification via population or ad hoc sampling as needed, and not
> overkill measurements which can be both costly and swamp operations people.
> 
> 2.5     OAM mechanisms should be designed to correctly function under
> failure conditions, ie OAM mechanisms must not rely on 'well-behaved'
> network behaviour after a defect has occurred since, by defintion, the fact
> that the network has failed implies its behaviour cannot be expected to be
> predictable.
> 
> 2.6     A failure in a server layer network should not create alarm storms
> in client layer networks.  Noting that client layer networks may be owned by
> different organisations and have a large geographical dispersion.
> 
> 2.7     data-plane LSP OAM should not require a return LSP to
> function....and even with bi-directional LSPs each direction should be
> monitored independently.
> 
> 2.8     failures of the control-plane (either the data-plane aspect or any
> one of the control-plane protocols that run on this) should not result in
> failures of the customer traffic data-plane where such trails are permanent
> or semi-permanent (this is more obvious in the GMPLS case).
> 
> The Ping draft either does not address or fails to meet many of these
> requirements.  More specifically:
> -       it does not identify/specify any defects;
> -       it cannot detect any defects unless it is running continuously;
> -       it could not detect or diagnose all potential defects even if
> running continously, eg mismerging defects;
> -       it relies on both cross-layer (ie ICMP is a IP layer OAM mechanism)
> and cross-plane (ie it is only defined for the RSVP-TE case) functional
> dependancies.  Although this expediency may have some near-term benefit(?)
> it is not good arch practice.
> -       since defects are not detected/defined it is unclear (i) what
> consequent actions are to be taken to protect customer traffic (none it
> seems), (ii) how this will help operators target diagnostic activities or
> (iii) how it will prevent client layer alarm storms.
> -       its all very well having variable behaviour (wrt timers) but this
> will create complexities elsewhere, ie inconsistent entry/exit criteria for
> defects and  availability states, no clear/consistent datums for when QoS
> metrics collection (if being measured) are valid, ie when to start/stop QoS
> metric aggregation.  So this means complexities in SLA derivations.
> 
> So, whilst I would certainly not want to stop any vendor implementing this
> draft, as an operator with responsibilities for the integrity of my
> customers traffic and a need to define SLAs for such, I find it
> professionally very difficult to condone it becoming a 'standard'.
> 
> >
> > > Secondly, can we have some
> > > input/justification with regards to the assumption that there is a
> > > need to detect such a failure since only cause for such a failure
> > > given in the draft is "memory Corruption". What causes so called
> > > memory corruption? Are we trying to repair badly written code here?
> > >
> >
> > The problem did happen in the real network. I'm not going to point
> > fingers on the cause of the problem. But we (vendors and
> > providers) did
> > realize from that incidence that we need to have a simple mechanism to
> > check LSP's data plane in today's operational networks.
> NH=> Some time ago I was hearing comments such as 'these defects won't
> happen' and 'don't fix software bugs with additional software' (an erroneous
> argument anyway).  So having now establised that defects do exist, I fully
> agree we need some way of detecting/diagnosing such defects.  But this seems
> not what is being advocated here, since the mechanism does not help
> operators to detect the defects in the first place.  To address the
> detection aspect some mechanism has to run continuously, and a technique
> that is unidirectional would seem to be simpler/better than one that
> requires a return path.  At some point one needs to raise an alarm and take
> other consequent actions (eg maybe even squelching customer traffic if
> intergity/security could be being compromised, and stop any QoS metric
> aggregation (if being used)) and raise a vertical indication to the NMS...so
> what better place to do this than at the point where the defect is being
> initially detected, ie the LSP sink?
> 
> Given all this, could those advocating the Ping draft please explain what is
> actually wrong with the more comprehensive solution we orginally proposed
> (in draft-harrison-mpls-oam-00.txt).....and in particular, the use of the CV
> flow.....which is really nothing more than a keepalive running from soure to
> sink with a unique source identifier and deterministic behaviour....surely
> it can't be much simpler than this can it?
> >
> > The whole purpose on various proposals (fast reroute, graceful restart
> > and LSP-ping) is to make the whole network working nicely,
> > even when we
> > have routers from other vendors that have badly written code.
> >
> > - Ping
> >