The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2001-Aug> msg00055



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]

  • From: Shahram Davari <Shahram_Davari@pmc-sierra.com>
  • Date: Tue, 7 Aug 2001 03:33:51 -0700
  • Cc: pingpan@juniper.net, Brijesh@coronanetworks.com, mpls@UU.NET

Hi Robert,

Since I got this message first, let me answer some of your concerns. I ma sure Neil has his own response too.

See below:

> -----Original Message-----
> From: Robert Raszuk [mailto:raszuk@cisco.com]
> Sent: Saturday, August 04, 2001 5:25 PM
> To: neil.2.harrison@bt.com
> Cc: pingpan@juniper.net; Brijesh@coronanetworks.com; mpls@UU.NET
> Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
> 
> 
> Hi Neil,
> 
> > Given all this, could those advocating the Ping draft 
> please explain what is
> > actually wrong with the more comprehensive solution we 
> orginally proposed
> > (in draft-harrison-mpls-oam-00.txt).....
> 
> Let me try ...
> 
> First let me say that your draft seems to be architecturally perfect.
> Unfortunately in practice in may meet the following obstacles:
> 
> * It is very CPU intensive. It requires per each LSP continues
> processing of incomming CV packets in the transit nodes of 
> such an LSP.
> Imagine the case that you have 15000 of LSPs in a transit 
> node and each
> send you periodically a CV packet. (Ping's draft does not require any
> CPU work per transt LSP).

1) I think you probably have not read our draft carefully. CV packets are only revealed at the egress LSR of the LSP being tested. Therefore only ingress and egress LSR of the LSP under test require to process the CV (or MPLS OAM) packets. The intermediate (transit) LSR do not even see the OAM label, and therefore forward the CV packet as a normal MPLS packet.

2) In order to detect that a failure has occurred, and to take corrective actions, you MUST continuously run some kind of liveliness message. This is unavoidable, unless you wait for a customer to complain that he is not receiving his traffic, which is very slow and helps providers loose customers.

> 
> * In the failure detection situation I don't see how it guarantees the
> successful notification propagation the head. I can easily think of
> cases when your returning FDI or BDI signals get lost due to 
> some purly
> forwarding bugs. (Ping's draft uses a control channel for return which
> is working correctly).

FDI and BDI are not mandatory. They help you better handle a defect, but you could still detect and process a defect without using FDI/BDI. 

> * It requires 100% of software upgrade in all routes in a 
> given network
> (I am sure you realize quite a challange in large networks).

As I said at the beginning it only requires upgrade of the ingress and egress LSRs not all LSRs. Ping's proposal requires the same upgrade too.
 
> 
> * It requires simultaneous support from all vendors on a given network.
> (Is this realistic :) ?

This is true for Ping's proposal too. Ping's draft says : "Before initiating the liveliness test, the user must make sure that both ingress and egress LSR can support the LSR-ping."

> 
> * With all of the complications it does not seems to guarantee the
> detection of LDP LSPs failures - only TE LSPs failures. (Sure 
> Ping does
> not address this neither yet but is way much more simpler).

We have updated our proposal, and now it can support LDP LSPs too.

> 
> * I don't see in the Ping's draft any requirement for end 
> user = client
> action - to detect the LSP failure - since you have brought this
> argument a lot could you elaborate a bit where in his draft 
> you see such
> a need ?

Let me explain. The LSP-ping is only a diagnosis tool, not a failure detection tool. In other words after a failure has been detected (by other means), the LSP-ping can tell you whether this failure is due to forward LSP or reverse LSP. Now the question is which method detects whether a failure has occurred at the first place. The answer is not very clear from the Ping's draft. As far as I have understood there are probably two possibilities:

1) Wait for the customer to report that he is not receiving his traffic. or
2) Run continuous ICMP ping.

The first method is only acceptable to providers who like loosing customer. The second method is essentially the same as the CV flow that we proposed, because:

a) It runs continuously
b) In general it requires a reserved label (IPV4 explicit null)

But it has a major problem: "Layer violation"


Hope that helps.

Thanks,
-Shahram

> 
> Rgs,
> Robert
> 
> 
> > neil.2.harrison@bt.com wrote:
> > 
> > Ping wrote:
> > > Sent: 26 July 2001 01:22
> > > To: Brijesh Kumar
> > > Cc: mpls-list
> > > Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
> > >
> > >
> > > > Brijesh Kumar wrote:
> > > >
> > > > From this draft:
> > > >
> > > >    When the ingress LSR suspects that the LSP may have
> > > failed and the
> > > >    RSVP control plane shows the LSP as operational, the 
> ingress LSR
> > > > MUST
> > > >    send LSP-ping messages to the egress over the LSP, 
> periodically.
> > > > The
> > > >    value of the time interval should be configurable.
> > > >
> > > > Could you please tell me what do you mean by "LSR
> > > suspects"? In other
> > > > words,
> > > > what is the triggering input for LSR to start "suspecting"
> > > the LSP may
> > > > have failed?
> > > >
> > > > It will be better if you replace "suspect" with "detect" in
> > > your draft
> > > > as suspect is an imprecise human emotion and no 
> operator would like
> > > > routers that have negative emotions ;-).
> > > >
> > >
> > > It's network operators suspect something may have gone 
> wrong with some
> > > LSP's....
> > NH=> How do they 'suspect' this?....is some crystal-ball going to be
> > supplied with the implementation?  It is not acceptable to 
> have to rely on
> > customers to 1st complain.  So what is the detection 
> mechanism for the
> > defects, you need to explain this more precisely...see also 
> later comments.
> > >
> > > > In addition, I do have issues with introducing a 
> mechanism dependent
> > > > on a particular signalling protocol for a generic 
> problem - testing
> > > > liveness of the label switched data path.
> > >
> > > OK. Do you have problem with the mechanism itself?
> > NH=> I have some problems with it.
> > Coming back from 2 weeks leave and trying to wade through 
> the large volume
> > of discussion this draft has generated tells me there is a 
> fundamental lack
> > of clarity of the problem being solved here.  We really 
> should get the
> > problem statement agreed and also agree a set of OAM arch 
> principles to
> > prevent bad practice. In summary here is what I 
> propose...and I would
> > appreciate some positive feedback on each of these 
> statements if there is
> > disagreement:
> > 
> > 1       Problem statement:
> > 
> > Operators need to determine the operational status of LSPs. 
>  They need
> > simple management tools to do this and operational costs 
> must be driven down
> > (ie we can't afford an army of CCIEs to run networks).  As 
> a minimum, this
> > requires all defects to be identified and specified in 
> terms of entry/exit
> > criteria and the appropriate consequent actions to be taken 
> in order to (i)
> > protect customer traffic and (ii) allow simple diagnostics 
> for operational
> > staff.  This is also required so that the 
> operators/customers can define
> > consistent and measurable availability SLAs and, as a 
> dependent corollary,
> > QoS SLAs (since QoS only has any meaning when the LSP is in 
> the available
> > state).  Customer/operator SLAs are increasing in 
> importance and this needs
> > to be recognised.
> > 
> > 2       Arch principles:
> > Note - although stated against the MPLS case, these can 
> (and should) be
> > generalised to apply to any network...including the GMPLS 
> cases (where
> > perhaps they may be more obvious to some).
> > 
> > 2.1     the data-plane and control-plane must have 
> independent OAM.  Note
> > that the control-plane must use some data-plane (not 
> necessarily the same as
> > that used for customer traffic) and hence needs both its 
> own data-plane OAM
> > and OAM (aka as error handling) which is specific to each of the
> > control-plane protocols in use....so that would be specific 
> OAM for RSVP-TE,
> > CR-LDP, LDP, OSPF, etc.
> > 
> > 2.2     cross-layer network data-plane OAM dependancies 
> must be avoided.  In
> > other words, one should not rely on OAM in technology X to 
> detect/diagnose
> > defects in technology Y (whether X is a client or server 
> layer wrt Y).  This
> > requirement is also necessary for evolution and backwards 
> compatibility, ie
> > to allow layer networks/protocols to be 
> added/removed/modified independently
> > and without affecting other layer networks/protocols.
> > 
> > 2.3     cross-plane OAM dependancies must be avoided.  For 
> example, the
> > data-plane OAM should function independent of the signalling/routing
> > protocols being used (including the case of no 
> control-plane).  This has
> > similar evolution and backwards compatibilty issues as in 2.2.
> > 
> > 2.4     customers must not be expected to act as the default defect
> > detection devices.  Nor must customer traffic activity be 
> used as part of
> > the defect detection function (since users can be 
> quiescent).  Note that
> > defect detection and defect diagnosis are 2 distinct 
> functions.  This also
> > implies that defect detection needs to be a continuous 
> function (at least on
> > important LSPs).
> > BTW - IMO QoS measurements are 2nd order importance 
> compared to defect
> > handling.  QoS should be tackled as a network design and TE 
> issue, with
> > suitable verification via population or ad hoc sampling as 
> needed, and not
> > overkill measurements which can be both costly and swamp 
> operations people.
> > 
> > 2.5     OAM mechanisms should be designed to correctly 
> function under
> > failure conditions, ie OAM mechanisms must not rely on 
> 'well-behaved'
> > network behaviour after a defect has occurred since, by 
> defintion, the fact
> > that the network has failed implies its behaviour cannot be 
> expected to be
> > predictable.
> > 
> > 2.6     A failure in a server layer network should not 
> create alarm storms
> > in client layer networks.  Noting that client layer 
> networks may be owned by
> > different organisations and have a large geographical dispersion.
> > 
> > 2.7     data-plane LSP OAM should not require a return LSP to
> > function....and even with bi-directional LSPs each 
> direction should be
> > monitored independently.
> > 
> > 2.8     failures of the control-plane (either the 
> data-plane aspect or any
> > one of the control-plane protocols that run on this) should 
> not result in
> > failures of the customer traffic data-plane where such 
> trails are permanent
> > or semi-permanent (this is more obvious in the GMPLS case).
> > 
> > The Ping draft either does not address or fails to meet 
> many of these
> > requirements.  More specifically:
> > -       it does not identify/specify any defects;
> > -       it cannot detect any defects unless it is running 
> continuously;
> > -       it could not detect or diagnose all potential 
> defects even if
> > running continously, eg mismerging defects;
> > -       it relies on both cross-layer (ie ICMP is a IP 
> layer OAM mechanism)
> > and cross-plane (ie it is only defined for the RSVP-TE 
> case) functional
> > dependancies.  Although this expediency may have some 
> near-term benefit(?)
> > it is not good arch practice.
> > -       since defects are not detected/defined it is 
> unclear (i) what
> > consequent actions are to be taken to protect customer 
> traffic (none it
> > seems), (ii) how this will help operators target diagnostic 
> activities or
> > (iii) how it will prevent client layer alarm storms.
> > -       its all very well having variable behaviour (wrt 
> timers) but this
> > will create complexities elsewhere, ie inconsistent 
> entry/exit criteria for
> > defects and  availability states, no clear/consistent 
> datums for when QoS
> > metrics collection (if being measured) are valid, ie when 
> to start/stop QoS
> > metric aggregation.  So this means complexities in SLA derivations.
> > 
> > So, whilst I would certainly not want to stop any vendor 
> implementing this
> > draft, as an operator with responsibilities for the integrity of my
> > customers traffic and a need to define SLAs for such, I find it
> > professionally very difficult to condone it becoming a 'standard'.
> > 
> > >
> > > > Secondly, can we have some
> > > > input/justification with regards to the assumption that 
> there is a
> > > > need to detect such a failure since only cause for such 
> a failure
> > > > given in the draft is "memory Corruption". What causes so called
> > > > memory corruption? Are we trying to repair badly 
> written code here?
> > > >
> > >
> > > The problem did happen in the real network. I'm not going to point
> > > fingers on the cause of the problem. But we (vendors and
> > > providers) did
> > > realize from that incidence that we need to have a simple 
> mechanism to
> > > check LSP's data plane in today's operational networks.
> > NH=> Some time ago I was hearing comments such as 'these 
> defects won't
> > happen' and 'don't fix software bugs with additional 
> software' (an erroneous
> > argument anyway).  So having now establised that defects do 
> exist, I fully
> > agree we need some way of detecting/diagnosing such 
> defects.  But this seems
> > not what is being advocated here, since the mechanism does not help
> > operators to detect the defects in the first place.  To address the
> > detection aspect some mechanism has to run continuously, 
> and a technique
> > that is unidirectional would seem to be simpler/better than one that
> > requires a return path.  At some point one needs to raise 
> an alarm and take
> > other consequent actions (eg maybe even squelching customer 
> traffic if
> > intergity/security could be being compromised, and stop any 
> QoS metric
> > aggregation (if being used)) and raise a vertical 
> indication to the NMS...so
> > what better place to do this than at the point where the 
> defect is being
> > initially detected, ie the LSP sink?
> > 
> > Given all this, could those advocating the Ping draft 
> please explain what is
> > actually wrong with the more comprehensive solution we 
> orginally proposed
> > (in draft-harrison-mpls-oam-00.txt).....and in particular, 
> the use of the CV
> > flow.....which is really nothing more than a keepalive 
> running from soure to
> > sink with a unique source identifier and deterministic 
> behaviour....surely
> > it can't be much simpler than this can it?
> > >
> > > The whole purpose on various proposals (fast reroute, 
> graceful restart
> > > and LSP-ping) is to make the whole network working nicely,
> > > even when we
> > > have routers from other vendors that have badly written code.
> > >
> > > - Ping
> > >
>