Python / Django: Release It! Design and Deploy Production-Ready Software

Stability Antipatterns

Run longevity tests. It’s the only way to catch longevity bugs.

Integration Point

Beware this necessary evil (Every integration point will eventually fail in some way, and you need to be prepared for that failure.)
Prepare for the many forms of failure
Know when to open up abstractions (Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug at the application layer, because most of them violate the high-level protocols. Packet sniffers and other network diagnostics can help.)
Failures propagate quickly
Apply patterns to avert Integration Points problems (Defensive programming via Circuit Breaker, Timeouts, Decoupling Middleware, and Handshaking will all help you avoid the dangers of Integration Points.)

Chain Reactions

One server down jeopardizes the rest
Hunt for resource leaks
Hunt for obscure timing bugs
Defend with Bulkheads (Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service—though they won’t help the callers of whichever partition does go down. Use Circuit Breaker on the calling side for that.)

Cascading Failures

A cascading failure occurs when problems in one layer cause problems in callers.

Stop cracks from jumping the gap (A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.)
Scrutinize resource pools (A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.)
Defend with Timeouts and Circuit Breaker (A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled one.)

Users

Users consume memory (Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.)
Users do weird, random things (Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Hire a bunch of chimpanzees to hammer on keyboards for more realistic testing.)

Malicious users are out there

Users will gang up on you

Blocked Threads

The Blocked Threads antipattern is the proximate cause of most failures (Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slow-down” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures.)
Scrutinize resource pools
Use proven primitives (Any library of concurrency utilities has more testing than your newborn queue.)
Defend with Timeouts
Beware the code you cannot see (All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself.)

Attacks of Self-Denial

Good marketing can kill you at any time.

Keep the lines of communication open
Protect shared resources
Expect rapid redistribution of any cool or valuable offer

Scaling Effects

Examine production versus QA environments to spot Scaling Effects (You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.)
Watch out for point-to-point communication
Watch out for shared resources

Unbalanced Capacities

Examine server and thread counts (In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle, in production compared to QA.)
Observe near scaling effects and users (Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a Slashdot or Digg post causes traffic to suddenly flood websites.)
Stress both sides of the interface (If you provide the back-end system, see what happens if it suddenly gets ten times the highest ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, see what happens if calls to the back end stop responding or get very slow.)

Slow Responses

Slow Responses triggers Cascading Failures
For websites, Slow Responses causes more traffic (Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.)
Consider Fail Fast
Hunt for memory leaks or resource contention

SLA Inversion

Don’t make empty promises
Examine every dependency
Decouple your SLAs

Unbounded Result Sets

Use realistic data volumes
Don’t rely on the data producers
Put limits into other application-level protocols

Stability Patterns

Use Timeouts

Apply to Integration Points, Blocked Threads, and Slow Responses
Apply to recover from unexpected failures
Consider delayed retries

Circuit Breaker

Don’t do it if it hurts
Use together with Timeouts
Expose, track, and report state changes

Bulkheads (Переборки)

Save part of the ship
Decide whether to accept less efficient use of resources
Pick a useful granularity
Very important with shared services models

Steady State (Установившееся состояние)

Avoid fiddling (Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run at least for a typical deployment cycle without manual disk cleanups or nightly restarts.)
Purge data with application logic
Limit caching
Roll the logs

Fail Fast

Avoid Slow Responses and Fail Fast
Reserve resources, verify Integration Points early
Use for input validation

Handshaking (Рукопожатие)

Create cooperative demand control
Consider health checks
Build Handshaking into your own low-level protocols

Test Harness

Emulate out-of-spec failures
Stress the caller
Leverage shared harnesses for common failures
Supplement, don’t replace, other testing methods

Decoupling Middleware

Decide at the last responsible moment
Avoid many failure modes through total decoupling
Learn many architectures, and choose among them

Capacity Antipatterns

Horizontal scaling

Vertical scaling

Resource Pool Contention

Eliminate contention under normal loads
If possible, size resource pools to the request thread pool
Prevent vicious cycles
Watch for the Blocked Threads pattern

Excessive JSP Fragments

Don’t use code for content (Loading JSP classes into memory is a kind of caching.)

AJAX Overkill

Avoid needless requests
Respect your session architecture
Minimize the size of replies
Increase the size of your web tier

Overstaying Sessions

Curtail session retention
Remember that users don’t understand sessions
Keep keys, not whole objects

Wasted Space in HTML

The Reload Button

Fast sites don’t provoke the user into hitting the Reload button.

Handcrafted SQL

Minimize handcrafted SQL

Database Eutrophication

Create indexes
Purge sludge
Keep reports out of production

Integration Point Latency

Integration point latency is like the house advantage in blackjack. The more often you play, the more often it works against you. Avoid chatty remote protocols. They take longer to execute, and they tie up those precious request-handling threads.

Cookie Monsters

Use cookies for identifiers, not entire objects.

Capacity Patterns

Pool Connections
Use Caching Carefully
Precompute Content
Tune the Garbage Collector

Transparency

Adaptation

Versioning API

Feb 21, 2012

Release It! Design and Deploy Production-Ready Software - Nygard M.T.