The MPLS WG Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
Hi Neil, A few follow up questions ... 1. > The node in the layer network, which first detects a defect (sourced > from within that layer), should apply a well-known 'Forward Defect > (1) There should be a complimentary Backward Defect Indication If FDIs & BDIs are optional I think you should s/should/may/. On the other hand I think that those msg are costly in cpu but very usefull (especially in the diagnostic case). Making them optional really changes the understanding of your draft. 2. Reg item 5.7 ... > It is therefore required that a unique trail source identifier be > periodically transmitted from the trail source to the trail sink to > detect these types of defect. > But, why should you always want to tell the head-end? If > you want to invoke prot-sw in a non 1+1 case from the head end then yes you > would want some return information, but the absolute minimum requirement is > (i) to detect all failures at the sink and (ii) to take appropriate > consequent actions at that point, eg raise a failure indication at the sink > point, stop QoS agggretaion, etc. Here we seems to quite disagree. IMHO source needs to know about given LSP failure to take corective action and not the sink. Rasing failure indication at the sink and counting for either an offline tool action or even worse human action to fix it is way too slow. If the LSP is broken source should immediately either shut given LSP or switch the traffic to another path (if past protection is applied). > Further, the Ping draft only seems > > to address simple breaks in connectivity (though it does not even detect > > these, which is a primary requirement, and appears to rely on customers to > > complain...which I would argue is unaccetable). I will either let Ping address this or we can talk live .... Rgs, Robert PS. It would be probably easier to get together next week and discuss this face to face. I will send an offline mail reg this. > neil.2.harrison@bt.com wrote: > > Hello Robert...please see comments below: > Regards, Neil > > > Hi Neil, > > > > > Given all this, could those advocating the Ping draft > > please explain what is > > > actually wrong with the more comprehensive solution we > > orginally proposed > > > (in draft-harrison-mpls-oam-00.txt)..... > > > > Let me try ... > > > > First let me say that your draft seems to be architecturally perfect. > NH=> It is not perfect, but its the closest and simplest scheme I have yet > seen. And it is the only one that meets most of our requirements as an > operator....you seem to have overlooked responding to these, or at least > acknowledging that the Ping draft fails to satisfy them, ie its no good > having a 'simple' (though I disagree the Ping draft is simpler that ours) > solution if its not fit for purpose is it? > > > Unfortunately in practice in may meet the following obstacles: > > > > * It is very CPU intensive. It requires per each LSP continues > > processing of incomming CV packets in the transit nodes of > > such an LSP. > > Imagine the case that you have 15000 of LSPs in a transit > > node and each > > send you periodically a CV packet. (Ping's draft does not require any > > CPU work per transt LSP). > NH=> I don't think you understand how this works. The whole point of it is > *not* to put any load on transit LSRs.......these simply pass the OAM CV > payload transparently (forwarding is on the lower 'normal traffic' LSP > header since for OAM pkts the label stack becomes 2 deep instead of 1). The > same would apply to any other OAM pkt types if employed, ie they would only > ever invoke processing at source and sink points. If you were doing some > per node loopback (like some route-trace say) then I agree this could be CPU > intensive since every node would have to check every LSP pkt to see if it > was a OAM route-trace pkt or not.....but our scheme was designed from the > outset to avoid this on purpose and hence be scalable. The only points that > have anything to do with CV processing are the source and sink. And > further, there is big difference between generating the CV at the LSP source > (which is trivial since it is always invariant) and processing it at the LSP > sink.....indeed, on non-important LSPs one would not have to process it if > you did not want (eg to generate availability statistics for such LSPs). > The only downside to not checking the CV at all LSP sink points is that we > might miss leaking traffic out of an important LSP onto a non-important > LSP.....clearly we can detect leaking the other way round (due to seeing an > unexpected CV flow arriving). > > > > * In the failure detection situation I don't see how it guarantees the > > successful notification propagation the head. I can easily think of > > cases when your returning FDI or BDI signals get lost due to > > some purly > > forwarding bugs. (Ping's draft uses a control channel for return which > > is working correctly). > NH=> FDI and BDI are 'nice to have' features but are not essential in a > single domain (they do however, become increasingly important when/if we > ever get round to inter-domain interworking or client/server services, eg > transit LSP service).....but CV is mandatory since this is the main defect > detection mechanism. FDI is there to stop alarm storms in higher layers and > BDI is there is you want to tell the source of outgoing defect, ie defect in > other direction. But, why should you always want to tell the head-end? If > you want to invoke prot-sw in a non 1+1 case from the head end then yes you > would want some return information, but the absolute minimum requirement is > (i) to detect all failures at the sink and (ii) to take appropriate > consequent actions at that point, eg raise a failure indication at the sink > point, stop QoS agggretaion, etc. So there has to be some form of > processing at the sink point in any case. Note that the draft does not > imply that BDI (if required) be always sent on a return LSP....it can go OOB > in the management-plane (or even the control-plane). Note also that, > assuming bi-directional LSPs, we can have unidirectional and bi-directional > failures (this also has implications for QoS processing, if being used, > which have to be done on a per direction basis). In former case the CV flow > of only the broken direction is interrupted (so here one could see incoming > CV and BDI) whilst in the latter case the CV flow of both directions is > missing.......and further, if the failure is from a lower layer then: > - this would appear as incoming AIS if below MPLS fabric (eg SDH) at > 1st downstream LSR past failure, or > - this would appear as incoming FDI (+ loss of CV) if lower LSP > failure and FDI being employed. > So, if full OAM set being used one gets comprensive and simple operational > diagnostics for our operational staff. > Note - there are further subtleties in FDI and BDI (if used) to aid failure > location and failure type that I have not discussed here...again they have > operational benefits in an inter-domain case. > > > > * It requires 100% of software upgrade in all routes in a > > given network > > (I am sure you realize quite a challange in large networks). > NH=> I agree some changes are needed....but this is true of any new > functions. The proposals are also backwards compatible. And note that the > CV processing is constrained to source/sink points as I noted above, so it > does not necsaarily imply wholesale fork-lift upgrade of all nodes at once. > However, I see MPLS as such a potentially important technology (for a whole > raft of reasons) that I want to see things like defect handling addressed > properly from the outset.....because if we don't, it is even harder to get > things done right once some inferior solution is adopted just to suit some > expediency. Indeed, this is the whole purpose of (i) agreeing the problem > statement and (ii) setting arch principles that must or should be met, ie > before someone decides that solution X is what is needed there should be > general agreement across the operator community that solution X is fit for > their purpose. > > > > * It requires simultanues support from all vendors on a given network. > > (Is this realistic :) ? > NH=> No its does not...it can be introduced gradually as I noted above. > > > > * With all of the complications it does not seems to guarantee the > > detection of LDP LSPs failures - only TE LSPs failures. (Sure > > Ping does > > not address this neither yet but is way much more simpler). > NH=> I disagree Ping is simpler.....you can't get simpler than the CV flow > (it is only a 'hello' by another name). We have been looking at extending > the support to LDP and some ideas here. Further, the Ping draft only seems > to address simple breaks in connectivity (though it does not even detect > these, which is a primary requirement, and appears to rely on customers to > complain...which I would argue is unaccetable). Not only does CV detect > below/within MPLS fabric simple breaks, it also detects all mismerging cases > and so offers both greater functionally and protection of customer traffic > integrity. > > > > * I don't see in the Ping's draft any requirement for end > > user = client > > action - to detect the LSP failure - since you have brought this > > argument a lot could you elaborate a bit where in his draft > > you see such > > a need ? > NH=> Why not ask Ping how defects are detected in the 1st place? I asked > this in my orginal mail and so did Brijesh Kumar. I am awaiting a response. > All Ping said in his response to Brijesh was: > "> It's network operators suspect something may have gone wrong with some > > LSP's...." So I too want to understand how operators are supposed to > 'suspect' something is wrong with an LSP. > > > > Rgs, > > Robert > > > > > > > neil.2.harrison@bt.com wrote: > > > > > > Ping wrote: > > > > Sent: 26 July 2001 01:22 > > > > To: Brijesh Kumar > > > > Cc: mpls-list > > > > Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt] > > > > > > > > > > > > > Brijesh Kumar wrote: > > > > > > > > > > From this draft: > > > > > > > > > > When the ingress LSR suspects that the LSP may have > > > > failed and the > > > > > RSVP control plane shows the LSP as operational, the > > ingress LSR > > > > > MUST > > > > > send LSP-ping messages to the egress over the LSP, > > periodically. > > > > > The > > > > > value of the time interval should be configurable. > > > > > > > > > > Could you please tell me what do you mean by "LSR > > > > suspects"? In other > > > > > words, > > > > > what is the triggering input for LSR to start "suspecting" > > > > the LSP may > > > > > have failed? > > > > > > > > > > It will be better if you replace "suspect" with "detect" in > > > > your draft > > > > > as suspect is an imprecise human emotion and no > > operator would like > > > > > routers that have negative emotions ;-). > > > > > > > > > > > > > It's network operators suspect something may have gone > > wrong with some > > > > LSP's.... > > > NH=> How do they 'suspect' this?....is some crystal-ball going to be > > > supplied with the implementation? It is not acceptable to > > have to rely on > > > customers to 1st complain. So what is the detection > > mechanism for the > > > defects, you need to explain this more precisely...see also > > later comments. > > > > > > > > > In addition, I do have issues with introducing a > > mechanism dependent > > > > > on a particular signalling protocol for a generic > > problem - testing > > > > > liveness of the label switched data path. > > > > > > > > OK. Do you have problem with the mechanism itself? > > > NH=> I have some problems with it. > > > Coming back from 2 weeks leave and trying to wade through > > the large volume > > > of discussion this draft has generated tells me there is a > > fundamental lack > > > of clarity of the problem being solved here. We really > > should get the > > > problem statement agreed and also agree a set of OAM arch > > principles to > > > prevent bad practice. In summary here is what I > > propose...and I would > > > appreciate some positive feedback on each of these > > statements if there is > > > disagreement: > > > > > > 1 Problem statement: > > > > > > Operators need to determine the operational status of LSPs. > > They need > > > simple management tools to do this and operational costs > > must be driven down > > > (ie we can't afford an army of CCIEs to run networks). As > > a minimum, this > > > requires all defects to be identified and specified in > > terms of entry/exit > > > criteria and the appropriate consequent actions to be taken > > in order to (i) > > > protect customer traffic and (ii) allow simple diagnostics > > for operational > > > staff. This is also required so that the > > operators/customers can define > > > consistent and measurable availability SLAs and, as a > > dependent corollary, > > > QoS SLAs (since QoS only has any meaning when the LSP is in > > the available > > > state). Customer/operator SLAs are increasing in > > importance and this needs > > > to be recognised. > > > > > > 2 Arch principles: > > > Note - although stated against the MPLS case, these can > > (and should) be > > > generalised to apply to any network...including the GMPLS > > cases (where > > > perhaps they may be more obvious to some). > > > > > > 2.1 the data-plane and control-plane must have > > independent OAM. Note > > > that the control-plane must use some data-plane (not > > necessarily the same as > > > that used for customer traffic) and hence needs both its > > own data-plane OAM > > > and OAM (aka as error handling) which is specific to each of the > > > control-plane protocols in use....so that would be specific > > OAM for RSVP-TE, > > > CR-LDP, LDP, OSPF, etc. > > > > > > 2.2 cross-layer network data-plane OAM dependancies > > must be avoided. In > > > other words, one should not rely on OAM in technology X to > > detect/diagnose > > > defects in technology Y (whether X is a client or server > > layer wrt Y). This > > > requirement is also necessary for evolution and backwards > > compatibility, ie > > > to allow layer networks/protocols to be > > added/removed/modified independently > > > and without affecting other layer networks/protocols. > > > > > > 2.3 cross-plane OAM dependancies must be avoided. For > > example, the > > > data-plane OAM should function independent of the signalling/routing > > > protocols being used (including the case of no > > control-plane). This has > > > similar evolution and backwards compatibilty issues as in 2.2. > > > > > > 2.4 customers must not be expected to act as the default defect > > > detection devices. Nor must customer traffic activity be > > used as part of > > > the defect detection function (since users can be > > quiescent). Note that > > > defect detection and defect diagnosis are 2 distinct > > functions. This also > > > implies that defect detection needs to be a continuous > > function (at least on > > > important LSPs). > > > BTW - IMO QoS measurements are 2nd order importance > > compared to defect > > > handling. QoS should be tackled as a network design and TE > > issue, with > > > suitable verification via population or ad hoc sampling as > > needed, and not > > > overkill measurements which can be both costly and swamp > > operations people. > > > > > > 2.5 OAM mechanisms should be designed to correctly > > function under > > > failure conditions, ie OAM mechanisms must not rely on > > 'well-behaved' > > > network behaviour after a defect has occurred since, by > > defintion, the fact > > > that the network has failed implies its behaviour cannot be > > expected to be > > > predictable. > > > > > > 2.6 A failure in a server layer network should not > > create alarm storms > > > in client layer networks. Noting that client layer > > networks may be owned by > > > different organisations and have a large geographical dispersion. > > > > > > 2.7 data-plane LSP OAM should not require a return LSP to > > > function....and even with bi-directional LSPs each > > direction should be > > > monitored independently. > > > > > > 2.8 failures of the control-plane (either the > > data-plane aspect or any > > > one of the control-plane protocols that run on this) should > > not result in > > > failures of the customer traffic data-plane where such > > trails are permanent > > > or semi-permanent (this is more obvious in the GMPLS case). > > > > > > The Ping draft either does not address or fails to meet > > many of these > > > requirements. More specifically: > > > - it does not identify/specify any defects; > > > - it cannot detect any defects unless it is running > > continuously; > > > - it could not detect or diagnose all potential > > defects even if > > > running continously, eg mismerging defects; > > > - it relies on both cross-layer (ie ICMP is a IP > > layer OAM mechanism) > > > and cross-plane (ie it is only defined for the RSVP-TE > > case) functional > > > dependancies. Although this expediency may have some > > near-term benefit(?) > > > it is not good arch practice. > > > - since defects are not detected/defined it is > > unclear (i) what > > > consequent actions are to be taken to protect customer > > traffic (none it > > > seems), (ii) how this will help operators target diagnostic > > activities or > > > (iii) how it will prevent client layer alarm storms. > > > - its all very well having variable behaviour (wrt > > timers) but this > > > will create complexities elsewhere, ie inconsistent > > entry/exit criteria for > > > defects and availability states, no clear/consistent > > datums for when QoS > > > metrics collection (if being measured) are valid, ie when > > to start/stop QoS > > > metric aggregation. So this means complexities in SLA derivations. > > > > > > So, whilst I would certainly not want to stop any vendor > > implementing this > > > draft, as an operator with responsibilities for the integrity of my > > > customers traffic and a need to define SLAs for such, I find it > > > professionally very difficult to condone it becoming a 'standard'. > > > > > > > > > > > > Secondly, can we have some > > > > > input/justification with regards to the assumption that > > there is a > > > > > need to detect such a failure since only cause for such > > a failure > > > > > given in the draft is "memory Corruption". What causes so called > > > > > memory corruption? Are we trying to repair badly > > written code here? > > > > > > > > > > > > > The problem did happen in the real network. I'm not going to point > > > > fingers on the cause of the problem. But we (vendors and > > > > providers) did > > > > realize from that incidence that we need to have a simple > > mechanism to > > > > check LSP's data plane in today's operational networks. > > > NH=> Some time ago I was hearing comments such as 'these > > defects won't > > > happen' and 'don't fix software bugs with additional > > software' (an erroneous > > > argument anyway). So having now establised that defects do > > exist, I fully > > > agree we need some way of detecting/diagnosing such > > defects. But this seems > > > not what is being advocated here, since the mechanism does not help > > > operators to detect the defects in the first place. To address the > > > detection aspect some mechanism has to run continuously, > > and a technique > > > that is unidirectional would seem to be simpler/better than one that > > > requires a return path. At some point one needs to raise > > an alarm and take > > > other consequent actions (eg maybe even squelching customer > > traffic if > > > intergity/security could be being compromised, and stop any > > QoS metric > > > aggregation (if being used)) and raise a vertical > > indication to the NMS...so > > > what better place to do this than at the point where the > > defect is being > > > initially detected, ie the LSP sink? > > > > > > Given all this, could those advocating the Ping draft > > please explain what is > > > actually wrong with the more comprehensive solution we > > orginally proposed > > > (in draft-harrison-mpls-oam-00.txt).....and in particular, > > the use of the CV > > > flow.....which is really nothing more than a keepalive > > running from soure to > > > sink with a unique source identifier and deterministic > > behaviour....surely > > > it can't be much simpler than this can it? > > > > > > > > The whole purpose on various proposals (fast reroute, > > graceful restart > > > > and LSP-ping) is to make the whole network working nicely, > > > > even when we > > > > have routers from other vendors that have badly written code. > > > > > > > > - Ping > > > > > >
|
|