The MPLS WG Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt]
Ping wrote: > Sent: 26 July 2001 01:22 > To: Brijesh Kumar > Cc: mpls-list > Subject: Re: [Fwd: I-D ACTION:draft-pan-lsp-ping-01.txt] > > > > Brijesh Kumar wrote: > > > > From this draft: > > > > When the ingress LSR suspects that the LSP may have > failed and the > > RSVP control plane shows the LSP as operational, the ingress LSR > > MUST > > send LSP-ping messages to the egress over the LSP, periodically. > > The > > value of the time interval should be configurable. > > > > Could you please tell me what do you mean by "LSR > suspects"? In other > > words, > > what is the triggering input for LSR to start "suspecting" > the LSP may > > have failed? > > > > It will be better if you replace "suspect" with "detect" in > your draft > > as suspect is an imprecise human emotion and no operator would like > > routers that have negative emotions ;-). > > > > It's network operators suspect something may have gone wrong with some > LSP's.... NH=> How do they 'suspect' this?....is some crystal-ball going to be supplied with the implementation? It is not acceptable to have to rely on customers to 1st complain. So what is the detection mechanism for the defects, you need to explain this more precisely...see also later comments. > > > In addition, I do have issues with introducing a mechanism dependent > > on a particular signalling protocol for a generic problem - testing > > liveness of the label switched data path. > > OK. Do you have problem with the mechanism itself? NH=> I have some problems with it. Coming back from 2 weeks leave and trying to wade through the large volume of discussion this draft has generated tells me there is a fundamental lack of clarity of the problem being solved here. We really should get the problem statement agreed and also agree a set of OAM arch principles to prevent bad practice. In summary here is what I propose...and I would appreciate some positive feedback on each of these statements if there is disagreement: 1 Problem statement: Operators need to determine the operational status of LSPs. They need simple management tools to do this and operational costs must be driven down (ie we can't afford an army of CCIEs to run networks). As a minimum, this requires all defects to be identified and specified in terms of entry/exit criteria and the appropriate consequent actions to be taken in order to (i) protect customer traffic and (ii) allow simple diagnostics for operational staff. This is also required so that the operators/customers can define consistent and measurable availability SLAs and, as a dependent corollary, QoS SLAs (since QoS only has any meaning when the LSP is in the available state). Customer/operator SLAs are increasing in importance and this needs to be recognised. 2 Arch principles: Note - although stated against the MPLS case, these can (and should) be generalised to apply to any network...including the GMPLS cases (where perhaps they may be more obvious to some). 2.1 the data-plane and control-plane must have independent OAM. Note that the control-plane must use some data-plane (not necessarily the same as that used for customer traffic) and hence needs both its own data-plane OAM and OAM (aka as error handling) which is specific to each of the control-plane protocols in use....so that would be specific OAM for RSVP-TE, CR-LDP, LDP, OSPF, etc. 2.2 cross-layer network data-plane OAM dependancies must be avoided. In other words, one should not rely on OAM in technology X to detect/diagnose defects in technology Y (whether X is a client or server layer wrt Y). This requirement is also necessary for evolution and backwards compatibility, ie to allow layer networks/protocols to be added/removed/modified independently and without affecting other layer networks/protocols. 2.3 cross-plane OAM dependancies must be avoided. For example, the data-plane OAM should function independent of the signalling/routing protocols being used (including the case of no control-plane). This has similar evolution and backwards compatibilty issues as in 2.2. 2.4 customers must not be expected to act as the default defect detection devices. Nor must customer traffic activity be used as part of the defect detection function (since users can be quiescent). Note that defect detection and defect diagnosis are 2 distinct functions. This also implies that defect detection needs to be a continuous function (at least on important LSPs). BTW - IMO QoS measurements are 2nd order importance compared to defect handling. QoS should be tackled as a network design and TE issue, with suitable verification via population or ad hoc sampling as needed, and not overkill measurements which can be both costly and swamp operations people. 2.5 OAM mechanisms should be designed to correctly function under failure conditions, ie OAM mechanisms must not rely on 'well-behaved' network behaviour after a defect has occurred since, by defintion, the fact that the network has failed implies its behaviour cannot be expected to be predictable. 2.6 A failure in a server layer network should not create alarm storms in client layer networks. Noting that client layer networks may be owned by different organisations and have a large geographical dispersion. 2.7 data-plane LSP OAM should not require a return LSP to function....and even with bi-directional LSPs each direction should be monitored independently. 2.8 failures of the control-plane (either the data-plane aspect or any one of the control-plane protocols that run on this) should not result in failures of the customer traffic data-plane where such trails are permanent or semi-permanent (this is more obvious in the GMPLS case). The Ping draft either does not address or fails to meet many of these requirements. More specifically: - it does not identify/specify any defects; - it cannot detect any defects unless it is running continuously; - it could not detect or diagnose all potential defects even if running continously, eg mismerging defects; - it relies on both cross-layer (ie ICMP is a IP layer OAM mechanism) and cross-plane (ie it is only defined for the RSVP-TE case) functional dependancies. Although this expediency may have some near-term benefit(?) it is not good arch practice. - since defects are not detected/defined it is unclear (i) what consequent actions are to be taken to protect customer traffic (none it seems), (ii) how this will help operators target diagnostic activities or (iii) how it will prevent client layer alarm storms. - its all very well having variable behaviour (wrt timers) but this will create complexities elsewhere, ie inconsistent entry/exit criteria for defects and availability states, no clear/consistent datums for when QoS metrics collection (if being measured) are valid, ie when to start/stop QoS metric aggregation. So this means complexities in SLA derivations. So, whilst I would certainly not want to stop any vendor implementing this draft, as an operator with responsibilities for the integrity of my customers traffic and a need to define SLAs for such, I find it professionally very difficult to condone it becoming a 'standard'. > > > Secondly, can we have some > > input/justification with regards to the assumption that there is a > > need to detect such a failure since only cause for such a failure > > given in the draft is "memory Corruption". What causes so called > > memory corruption? Are we trying to repair badly written code here? > > > > The problem did happen in the real network. I'm not going to point > fingers on the cause of the problem. But we (vendors and > providers) did > realize from that incidence that we need to have a simple mechanism to > check LSP's data plane in today's operational networks. NH=> Some time ago I was hearing comments such as 'these defects won't happen' and 'don't fix software bugs with additional software' (an erroneous argument anyway). So having now establised that defects do exist, I fully agree we need some way of detecting/diagnosing such defects. But this seems not what is being advocated here, since the mechanism does not help operators to detect the defects in the first place. To address the detection aspect some mechanism has to run continuously, and a technique that is unidirectional would seem to be simpler/better than one that requires a return path. At some point one needs to raise an alarm and take other consequent actions (eg maybe even squelching customer traffic if intergity/security could be being compromised, and stop any QoS metric aggregation (if being used)) and raise a vertical indication to the NMS...so what better place to do this than at the point where the defect is being initially detected, ie the LSP sink? Given all this, could those advocating the Ping draft please explain what is actually wrong with the more comprehensive solution we orginally proposed (in draft-harrison-mpls-oam-00.txt).....and in particular, the use of the CV flow.....which is really nothing more than a keepalive running from soure to sink with a unique source identifier and deterministic behaviour....surely it can't be much simpler than this can it? > > The whole purpose on various proposals (fast reroute, graceful restart > and LSP-ping) is to make the whole network working nicely, > even when we > have routers from other vendors that have badly written code. > > - Ping >
|
|