A process in Linux can allocate memory without actually using it. This can create situations in which you have far more memory allocated than you have physically in the machine. We had one process that kept allocating memory without using it, until it ran into a barrier. Look at this munin graph to see what happened there:
Now I was puzzled by the 250GB limit. A 2.6 kernel should be able to allocate 1TB of memory on a machine, if it's available. So why would it run out of allocatable space at 250GB? Took me a little while, but after a tip from jtopper on IRC, I looked at the vm sysctl documentation and found the overcommit_ratio setting. Which happened to be 50. And the machine just happened to have about 5GB of RAM. Well look there, 5x50 is 250GB... We found the reason why the graph stops increasing at about 250GB!