Feb 21, 2012

Release It! Design and Deploy Production-Ready Software - Nygard M.T.

Stability Antipatterns

Run longevity tests. It’s the only way to catch longevity bugs.

Integration Point

  • Beware this necessary evil (Every integration point will eventually fail in some way, and you need to be prepared for that failure.)
  • Prepare for the many forms of failure
  • Know when to open up abstractions (Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug at the application layer, because most of them violate the high-level protocols. Packet sniffers and other network diagnostics can help.)
  • Failures propagate quickly
  • Apply patterns to avert Integration Points problems (Defensive programming via Circuit Breaker, Timeouts, Decoupling Middleware, and Handshaking will all help you avoid the dangers of Integration Points.)

Chain Reactions

  • One server down jeopardizes the rest
  • Hunt for resource leaks
  • Hunt for obscure timing bugs
  • Defend with Bulkheads (Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service—though they won’t help the callers of whichever partition does go down. Use Circuit Breaker on the calling side for that.)

Cascading Failures

A cascading failure occurs when problems in one layer cause problems in callers.
  • Stop cracks from jumping the gap (A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.)
  • Scrutinize resource pools (A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.)
  • Defend with Timeouts and Circuit Breaker (A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled one.)

Users


  • Users consume memory (Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.)
  • Users do weird, random things (Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Hire a bunch of chimpanzees to hammer on keyboards for more realistic testing.)
  • Malicious users are out there
  • Users will gang up on you

Blocked Threads

  • The Blocked Threads antipattern is the proximate cause of most failures (Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slow-down” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures.)
  • Scrutinize resource pools
  • Use proven primitives (Any library of concurrency utilities has more testing than your newborn queue.)
  • Defend with Timeouts
  • Beware the code you cannot see (All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself.)

Attacks of Self-Denial

Good marketing can kill you at any time.
  • Keep the lines of communication open
  • Protect shared resources
  • Expect rapid redistribution of any cool or valuable offer 

Scaling Effects



  • Examine production versus QA environments to spot Scaling Effects (You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.)
  • Watch out for point-to-point communication
  • Watch out for shared resources

Unbalanced Capacities

  • Examine server and thread counts (In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle, in production compared to QA.)
  • Observe near scaling effects and users (Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a Slashdot or Digg post causes traffic to suddenly flood websites.)
  • Stress both sides of the interface (If you provide the back-end system, see what happens if it suddenly gets ten times the highest ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, see what happens if calls to the back end stop responding or get very slow.)

Slow Responses

  • Slow Responses triggers Cascading Failures
  • For websites, Slow Responses causes more traffic (Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.)
  • Consider Fail Fast
  • Hunt for memory leaks or resource contention

SLA Inversion

  • Don’t make empty promises
  • Examine every dependency
  • Decouple your SLAs

Unbounded Result Sets

  • Use realistic data volumes
  • Don’t rely on the data producers
  • Put limits into other application-level protocols

Stability Patterns


Use Timeouts

  • Apply to Integration Points, Blocked Threads, and Slow Responses
  • Apply to recover from unexpected failures
  • Consider delayed retries

Circuit Breaker



  • Don’t do it if it hurts
  • Use together with Timeouts
  • Expose, track, and report state changes

Bulkheads (Переборки)


  • Save part of the ship
  • Decide whether to accept less efficient use of resources
  • Pick a useful granularity
  • Very important with shared services models

Steady State (Установившееся состояние)

  • Avoid fiddling (Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run at least for a typical deployment cycle without manual disk cleanups or nightly restarts.)
  • Purge data with application logic
  • Limit caching
  • Roll the logs
Fail Fast
  • Avoid Slow Responses and Fail Fast
  • Reserve resources, verify Integration Points early
  • Use for input validation

Handshaking (Рукопожатие)

  • Create cooperative demand control
  • Consider health checks
  • Build Handshaking into your own low-level protocols

Test Harness

  • Emulate out-of-spec failures
  • Stress the caller
  • Leverage shared harnesses for common failures
  • Supplement, don’t replace, other testing methods

Decoupling Middleware

  • Decide at the last responsible moment
  • Avoid many failure modes through total decoupling
  • Learn many architectures, and choose among them

Capacity Antipatterns

Horizontal scaling

Vertical scaling


Resource Pool Contention

  • Eliminate contention under normal loads
  • If possible, size resource pools to the request thread pool
  • Prevent vicious cycles
  • Watch for the Blocked Threads pattern

Excessive JSP Fragments

  • Don’t use code for content (Loading JSP classes into memory is a kind of caching.)

AJAX Overkill

  • Avoid needless requests
  • Respect your session architecture
  • Minimize the size of replies
  • Increase the size of your web tier

Overstaying Sessions

  • Curtail session retention
  • Remember that users don’t understand sessions
  • Keep keys, not whole objects

Wasted Space in HTML

The Reload Button

Fast sites don’t provoke the user into hitting the Reload button.

Handcrafted SQL

Minimize handcrafted SQL

Database Eutrophication

  • Create indexes
  • Purge sludge
  • Keep reports out of production

Integration Point Latency

Integration point latency is like the house advantage in blackjack. The more often you play, the more often it works against you. Avoid chatty remote protocols. They take longer to execute, and they tie up those precious request-handling threads.

Cookie Monsters

Use cookies for identifiers, not entire objects.

Capacity Patterns

  • Pool Connections 
  • Use Caching Carefully 
  • Precompute Content 
  • Tune the Garbage Collector

Transparency

Adaptation

Versioning API

No comments: