White Papers

White Papers
Customer Profiles
Testimonials

 

Contact Our Sales Team
    +1 866 600 5100
    +1 408 965 5100

Benchmarking Uptime for Your Business: Methodology and Best Practices

 

Table of Contents:

Introduction: Business Impact of Network Downtime

Causes of Unplanned Downtime of Network Applications

Reliability of the Network and Network Applications

Trading off Reliability and Cost

Conclusion

 

Introduction: Business Impact of Network Downtime

As networked software and computer applications play an ever more significant role in business operations and in employee communications, the old slogan that the "Network is the Computer" has become something of an understatement. In fact, for many organizations, the Ethernet network not only provides the basic Intranet/Internet connectivity for user productivity software, but also is an essential component of the supercomputer cluster, the mainframe, the data storage system, the eCommerce portal and the enterprise telephone system.

As a result of its increasingly central business role and as delivering software in the enterprise becomes more of a service, unavailability of a portion of the network due to some type of fault can cause a major disruption of business operations. Disrupted operations inevitably lead to reduced profitability, with the severity of the loss depending on the portions of the network and the particular operations affected by the outage. Some of the potential impacts of network downtime include:

Productivity: Whenever normal workflow patterns are disrupted by network outages, employee productivity measured in person-hours of work output will be compromised. Often the steps taken to compensate for lost productivity, such as overtime, travel, and temporary help, adds significantly to the cost of an outage. In an extreme case, where office workers are highly dependent on the network for both email and VoIP communications and for continual access to information from enterprise applications, network unavailability can reduce productivity to essentially zero. In the more general case, the productivity loss will depend on a number of factors including the enterprise IT environment, industry/business type, job responsibility and individual work styles. An individual organization can develop a reasonably accurate estimate of the vulnerability of worker productivity to network downtime by considering workflows for different classes of information workers.

Revenue: For some organizations the Software as a Service (SaaS) network plays a direct role in producing the goods and services that constitute its sources of revenue. For these enterprises, network downtime translates directly to lost or deferred revenue. For most other organizations the network is an integral part of a number of computerized sales processes, including order entry, eCommerce web sites, preparation and submission of price and delivery quotations, checking inventory, and accessing historical account information. For the latter enterprises, network downtime can lead to pricing errors and orders lost to more responsive competitors.

As with employee productivity, the exposure of revenue to downtime varies significantly among individual companies. Table 1 shows the results of market research conducted by Contingency Planning Research and Dataquest that provides a comparison of the revenue lost per hour of downtime for a variety of business operations.

Typical revenue loss per hour of downtime (Source: Gartner Dataquest, 2007)

Table 1: Typical revenue loss per hour of downtime
(Source: Gartner Dataquest, 2007)

Customer Satisfaction/Reputation: Network unavailability can also indirectly affect profitability by reducing responsiveness to customers, suppliers and business partners. In the long run, damaged business relationships can reduce revenue, increase costs and limit the range of business opportunities open to the enterprise.

Financial Performance: Downtime can also have an affect on the timeliness of a number of financial processes, resulting in a negative impact on the bottom line. Downtime can disrupt billing and payment operations, delaying revenue recognition, reducing cash flow and reducing discounts earned through timely payments.

 

Causes of Unplanned Downtime of Network Applications

In order to minimize downtime, it is necessary to understand all of the possible faults and errors that can lead to unplanned downtime of networked applications. According to research conducted by the Gartner Group, summarized in Figure 1, the failures within the network infrastructure contribute less than 20 percent of the downtime experienced by the typical enterprise. Operator errors and application failures account for the remaining 80 percent.

According to this data, initiatives to improve availability and minimize downtime cannot be focused entirely on maximizing the availability of the network and server infrastructures, which are included in the "technology" category shown in Figure 1.

Causes of downtime for network applications
(Source: Gartner Dataquest, 2007)

Figure 1: Causes of downtime for network applications (Source: Gartner Dataquest, 2007)

Reduction of operator errors requires a re-examination of the solutions deployed for network and system management with a view toward increasing automation of the interfaces used for configuring and provisioning network devices and servers. In addition to automation, the best way to reduce operator errors is to minimize the overall complexity of the networked computing environment.

Application failures can be minimized by leveraging clustering and virtualization technologies to fully exploit redundant configurations of networked servers. More information on improving automation, reducing complexity and maximizing reliability of networks and networked application servers is provided in the Force10 Networks white paper: Data Center Consolidation and Virtualization.

Blade server technology provides another aspect of physical server consolidation and reduction of complexity. Some of the blade server management systems address the issues of software complexity and operator errors by integrating management of physical servers, virtual servers, network infrastructure and application workload through a single user interface. Further reductions in data center complexity are possible using 10 Gigabit Ethernet as a unified server and data center switching fabric for LAN interconnect, clustered server interconnect and storage interconnect. To read more about Ethernet based high performance computing clusters, go to: http://www.force10networks.com/ whitepapers/ethernetclustering.asp.

Reliability of the Network and Network Applications

From the network perspective, maximizing availability and minimizing downtime can be achieved by considering the network as a system whose uptime is determined by the reliability of its components, its ability to continue operations in spite of component failures, and its manageability and serviceability to minimize recovery time when faults do occur.

Device Resiliency: The components of the network consist of all the switch/routers, applicances, and other devices that data must traverse in going from source to destination. Maximizing the reliability of these devices depends on a resilient hardware/software design that minimizes or eliminates single points of failure (SPOF) within the device. The reliability of network devices is normally specified in terms of mean time between failure (MTBF) expressed in hours or years. From its inception, Force10 Networks has implemented a product strategy of maximizing the resiliency and scalability of switch/routers by using redundant subsystems in both the data plane and control plane to eliminate SPOF and to maximize mission MTBF. Mission MTBF is a key metric for network devices with redundant subsystems because it is a measure of a device’s ability to continue to perform its mission in spite of internal failures among redundant elements. The E-Series and C-Series switch/routers both use the FTOS operating system, which supports numerous resiliency features including a modular design that isolates software faults and minimizes the need for system restarts. Further information on E-Series and C-Series resiliency and FTOS are available in the Force10 Networks white paper: FTOS Modular Operating System.

Network Design: Even with highly reliable and resilient network devices, it is necessary to employ network designs that use redundant devices functioning in parallel as means of eliminating SPOF within the network topology. SPOF in the network are incompatible with very high levels of availability (>99.9 percent) because mean time to repair or restore (MTTR) after a failure is typically four hours or more, while 99.9 percent availability implies that the network can be down less than 9 hours per year.

Software models of the network can be used to estimate network availability based on device MTBFs, MTTR and the network topology. Force10 Networks uses this type of model at the device level to evaluate the impact of design alternatives on the device’s MTBF and availability. A Force10 white paper, Virtual Data Center Architectures, illustrates the use of redundant design in data center networks. This paper also shows how it is possible to achieve a reduced level of data center complexity through the use of virtualization design modularity, with scalable 2-tier switching in place of traditional 3-tier switching designs. The paper also addresses how clustering and virtualization can help improve application availability.

Serviceability: Gaining the maximum benefit from redundant subsystems in network devices and minimizing MTTR can be realized through advanced serviceability features supported by the device operating system. With Force10 Networks FTOS, these features include proactive monitoring of key subsystems to ensure correct operation, as well as automated procedures that can circumvent certain types of faults or gather all the data the operator will require to make a rapid diagnosis of the problem. To read more about next-generation switch/router diagnostics and debugging, go to: http://www.force10networks.com/whitepapers/nextgenswitchrouterdd.asp.

Security: Network intrusions also pose a serious threat to network availability. Attacks may be focused on the network with denial of service (DOS) or on the end systems. A fully protected network uses multi-layered security measures such as:

  • Firewalls at the perimeter of the network and to protect key subnets
  • Nodal protection (e.g., anti-virus software and user authentication/authorization)
  • Attack resistant network infrastructure (e.g., switch/routers and appliances)
  • Intrusion prevention systems (IPS) to screen both Internet and Intranet traffic

The Force10 Networks E-Series switch/routers use separate control plane processors for routing, switching and management, preventing network attacks on one function from affecting the others. The E-Series also uses rate limiting of control plane traffic to isolate the device form DOS attacks. To read more about the Force10 FTOS Modular Operating System, go to:
http://www.force10networks.com/whitepapers/ftos.asp.

 

Trading off Reliability and Cost

From the previous discussion, it’s obvious that there may be considerable additional costs involved in building a reliable, high availability network. One approach to trading off reliability and cost has been developed by the Gartner Group and Force10 Networks.

  1. Develop an estimate of the cost of downtime for your business. This can be based on estimates of loss of productivity plus revenue loss. A possible refinement would be to develop different downtime costs for distinct parts of the network (e.g., data center, core, access, website).
  2. Develop a goal for network availability that is appropriate for your type of business, and use this availability to calculate the hours or minutes of downtime that can tolerated.
  3. The hours of downtime can then be used to compute the downtime risk, which in turn can be used to estimate additional capital and operation expense that can be applied to improved reliability. An example of such a calculation is shown in Table 2.
  4. Build an availability model of the current network and verify that it produces results in agreement with measured availability. Using the model, produce a modified network design to deliver the availability target of step 2.
  5. Compare the cost of the network enhancements to the downtime risk calculated in step 3.

Calculating downtime risk (Source: Gartner Dataquest, 2007)

Table 2: Calculating downtime risk (Source: Gartner Dataquest, 2007)
 

Conclusion

A network that is 99.99 percent available clearly has more value than one with 99 percent availability. How much more the additional availability is worth depends on the individual business or enterprise. Modeling the impact of network downtime on the business and modeling the availability provided by various network designs allows tradeoffs between reliability and network investment costs. These calculations also allow the designer of enterprise networks to fully appreciate the additional value of products, such as the Force10 Networks E-Series and C-Series switch/routers, that have been designed with a comprehensive suite of reliability and serviceability features.

 
back to top >
Downloads
White Paper