“Why ps Sucks” or “Counting Memory Consumption”

Recently, I held a course on select topics around embedded Linux, at a company in Zürich. The audience was pretty cool - they had ported their appliance from a hardcore embedded OS to Linux a couple of years ago. They are doing quite well nowadays. A bit too much realtime attitude maybe (SCHED_RR threads all over), but things appear to work.

On day 1, when we began to dive into multithreading (an inevitable topic nowadays), an interesting question came up. “We have 70 threads running, give each thread a stack size of 1 MB, and thus consume 70 MB for the stacks alone. Add heap and program, and a couple of other programs. Given a total memory of 128 MB, we’re soon dead.”

“Can’t be!” was my first attempt to clear up the situation. The attempt was rejected. ps output sure didn’t help a lot either. An explanation of virtual memory (part of the course anyway) was the second attempt, but still not bullet proof. More evidence needed. Fortunately day 1 was over at this point, and I was left with some overnight homework. During night I was able to come up with a plausible screenplay in example form, to give a basic understanding of how Linux does memory management. And that screenplay even backs my instinctive “Can’t be!” defense. It’s these late night experiments that I’m trying to share in this post.

Process Stack Management

First off, lets keep out multithreading and examine the stack behavior of a plain old process. The following program grows the stack up to a user supplied limit. Normally stack growth is done by calling functions on top of other functions on top of … . This is a bit cumbersome to program when you want to grow the stack up to a given size, so I use a handy little tricky function, alloca, to allocate stack space. It does essentially the same - grow stack -, and I don’t have to count stack addresses. Additionally, to be sure that the stack is actually used (“dirty”), I set the allocated bytes to zero, explicitly.

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <alloca.h>
#include <stdlib.h>

static size_t stack_growth; /* stack-allocated bytes */

int main(int argc, char** argv)
{
    void* mem;
    int error;

    if (argc != 2) {
        fprintf(stderr, "%s stack-growth\n", argv[0]);
        exit(1);
    }

    stack_growth = atoi(argv[1]);

    printf("PID: %d\n", getpid());

    mem = alloca(stack_growth);
    memset(mem, 0, stack_growth);

    printf("done\n");
    pause();
    return 0;
}

Compile like so,

$ gcc -o process-stack process-stack.c

So, lets start with a small stack,

$ ./process-stack 10
PID: 24299
done

Examine the various size attributes of the process, using the cool -o option to ps:

$ ps -o vsz,sz,size,rss -p 24299
   VSZ    SZ  SIZE   RSS
  3944   986   188   320

Ok, that’s really small. What do the columns mean? I sure don’t know - man ps is not very exact in its descriptions. Here’s my interpretation.

  • VSZ is the entire “virtual size”, whatever this means, in K. We sure can’t attribute read-only mappings of shared libraries like glibc to the process’s memory consumption - glibc’s code is shared between all processes that use it, and is resident in memory only once for all processes. Virtual memory basic usage, so to say. The VSZ column tells us nothing about memory usage, I presume.

  • SZ is the size of the “core image” of the process, in pages. Whatever that is. man ps tells me something about code, stack, data. The page size on my system is 4K, which leads me to assume that SZ roughly equals VSZ. I’m not interested in code, so forget about this one either.

  • SIZE looks promising, from what man ps tells me. “Amount of swap that would be required if the process were to dirty all writable pages and then be swapped out”. Allocated stack is dirtied by definition, so this appears to be a good measure of stack consumption - at least for our little stack-eater program. I assume that the size unit is 1K because SIZE is a little less than RSS (described below).

  • RSS, “Resident set size”, in 1K units. This is the amount of non-swapped memory the process is currently using. This does include in-core code pages as well, so this value is of limited use. Furthermore, I consider swapped memory relevant as well, and RSS doesn’t count that.

Conclusion: according to the SIZE column, allocating 10 bytes on the stack leads me to a program that consumes 188K of main memory. I suspect that this is the size of a minimal program anyway, even if it does not consume anything.

Anyway, let’s proceed with our tests and eat a million bytes stack.

$ ./process-stack 1000000
PID: 24908
done
$ ps -o vsz,sz,size,rss -p 24908
   VSZ    SZ  SIZE   RSS
  4800  1200  1044  1376

Ok, the columns have grown within reason and reflect what we did. Next, we become a bit greedy and want ten million bytes

$ ./process-stack 10000000
PID: 24960
Segmentation fault

We’ve hit the stack size limit 8MB which places a barrier against greedy people,

$ ulimit -s
8192

Eight million bytes is ok, and ps gives no surprise.

$ ./process-stack 8000000
PID: 25018
done
$ ps -o vsz,sz,size,rss -p 25018
   VSZ    SZ  SIZE   RSS
 11632  2908  7876  8236

Conclusion

The stack of a process starts small and grows on demand, magically, up to a limit. The logic is built in to the OS, which makes sense because it does not make sense to have a process without a stack. The operating system takes care of extending the stack by allocating memory under the hood, and we don’t want to bother.

Thread (pthread) Stack Management

Now for thread stacks. The story is a bit different here - POSIX threads have an attribute “stack size”. It can be explicitly set using pthread_attr_setstacksize(), or left default which is 2MB or the value of the RLIMIT_STACK resource limit if that is set (see man pthread_create). A test program similar to the one above, but with threads instead, would thus have the following parameters:

  • nthreads, the number of threads to create

  • stack-limit, the stack size attribute of each thread. We call it “limit” and not “size” as it will turn out that it is exactly that.

  • stack-growth, the number of bytes to allocate on the stack. This is done using alloca(), just like the process test program does.

The program creates nthreads threads. Each thread acts like the process example program above - allocate stack using alloca() and then shut up and sit. It looks as follows.

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>

static size_t nthreads;
static size_t stack_limit;
static size_t stack_growth; /* stack-allocated bytes */

static void* thread_func(void* arg)
{
    void* mem = alloca(stack_growth);
    memset(mem, 0, stack_growth);
    pause();
}

int main(int argc, char** argv)
{
    int i;
    pthread_attr_t attr;

    if (argc != 4) {
        fprintf(stderr, "%s nthreads stack-limit stack-growth\n", argv[0]);
        exit(1);
    }

    nthreads = atoi(argv[1]);
    stack_limit = atoi(argv[2]);
    stack_growth = atoi(argv[3]);

    printf("PID: %d\n", getpid());

    pthread_attr_init(&attr);
    if (stack_limit > 0) {
        int error = pthread_attr_setstacksize(&attr, stack_limit);
        if (error) {
            fprintf(stderr, "set stack size to %ld: %s (%d)\n",
                            stack_limit, strerror(error), error);
            exit(1);
        }
    }

    pthread_attr_t* p_attr = NULL;
    if (stack_limit > 0)
        p_attr = &attr;

    for (i=0; i&lt;nthreads; i++) {
        pthread_t id;
        int rv = pthread_create(&id, p_attr, thread_func, NULL);
        if (rv != 0) {
            fprintf(stderr, "failed after %d threads\n", i);
            exit(1);
        }
    }

    pause();
    return 0;
}

Compile like so,

$ gcc -pthread -o thread-stack thread-stack.c

Experiment #1: 100 default threads, eating no stack

Let’s create a hundred threads with default stack size, each eating 100 bytes of stack.

$ ./thread-stack 100 0 100
PID: 31524
$ ps -o vsz,sz,size,rss -p 31524
   VSZ    SZ  SIZE   RSS
825840 206460 819936 1404

So what? SIZE reports the process as consuming over 800MB of memory. According to ps’s description, “if it were to dirty all writeable pages”, then this would be the amount of swap required. A little calculation shows that SIZE is approximately 100 times 8MB. 8MB is the RLIMIT_STACK resource limit that is configured on my machine (check with ulimit -s), and we started 100 threads. So it appears that the process has allocated 800MB worth of physical memory pages, although only 100 bytes of each stack have been eaten.

“Can’t be!” is what I said.

Of course the RSS field reports much less - but RSS does not report swapped memory, so we cannot count on it very much.

But anyway - let’s accept the alleged waste of memory for a moment and carry on with the experiments.

Experiment #2: 100 default threads, eating up stack

The first experiment created 100 threads with default stack size 8MB, and consumed almost nothing of the stacks. Lets eat up the stacks and see what ps reports this time.

$ ./thread-stack 100 0 8000000
PID: 771
$ ps -o vsz,sz,size,rss -p 771
   VSZ    SZ  SIZE   RSS
825840 206460 819936 766604

Aha. SIZE hasn’t changed, but RSS reports much more than the last time around. Apparently RSS does have value - at least on my system where no swap is configured.

Experiment #3: 100 threads with limited stack

See what effect a stack limit has.

$ ./thread-stack 100 4096 10
PID: 1026
set stack size to 4096: Invalid argument (22)

Ok, we cannot limit the stack to only a single page. We don’t insist (PTHREAD_STACK_MIN is 4 pages anyway), so lets increase stack size and see what ps tells us.

$ ./thread-stack 100 16384 10
PID: 1125
$ ps -o vsz,sz,size,rss -p 1125
   VSZ    SZ  SIZE   RSS
  7840  1960  1936  1404

Well. 100 minimal threads lead to a process that consumes minimal resources. Fine.

Conclusion: Provided that we carefully limit our threads’ stacks, we don’t eat up too much memory.

Can’t be! Do I really have to fine-tune my stacks and risk stack overflows and hard to find bugs?

Experiment #4: more threads than my system could take (eat no stack)

Now a definitive take: I have 64 bit address space, 4G of physical RAM, and no swap configured. So, I could create no more than 512 threads with 8MB stack size each - 512*8MB == 4G. Let’s try that out and create 513 threads. Each of the threads eats only 10 bytes of its stack.

$ ./thread-stack 513 0 10
PID: 2212
$ ps -o vsz,sz,size,rss -p 2212
   VSZ    SZ  SIZE   RSS
4210920 1052730 4205016 4576

Works! ps reports more SIZE than my system can take. What did they say about SIZE, “if it were to dirty all writeable pages”? This suggests that pages totalling 4205016 bytes have been allocated to the process. I don’t have that many pages, so it seems like I misunderstand. RSS seems to be definitive about the size.

Experiment #5: more threads than my system could take (eat stack)

Obviously the system permits its processes to “overcommit” memory. Others still get their share. Nobody complained during experiment #4, music kept playing without noticeable stutter. Now lets actually use the stack.

$ ./thread-stack 513 0 8000000
PID: 4353
Killed

Ok, that’s what I’d expect. Until the process was killed, the Red Hot Chili Peppers had become overly funky (audio glitches all over), and the Adobe Flash Plugin had crashed (Good Riddance). Less threads …

$ ./thread-stack 400 0 8000000
PID: 8462
$ ps -o vsz,sz,size,rss -p 8462
   VSZ    SZ  SIZE   RSS
3284640 821160 3278736 3064580

It looks like I can create a bit more than 400 threads which eat up their 8MB stacks. Not bad, as these numbers lie well within the physical constraints of my machine.

So, when I am able to create 400 threads which eat up their 8MB (default) stacks, then I should be able to create about 800 threads which eat up half of their 8MB stacks, right?

$ ./thread-stack 800 0 4000000
PID: 11338

That was ok, try 900 threads …

$ ./thread-stack 900 0 4000000
PID: 12156
Killed

Conclusion: We don’t have to fine-tune stacks! Just as with the process example, thread stacks are allocated on demand, up to a limit. A valid reason to decrease the stack size limit to a lower value than the default is to keep it from eating up more memory than expected. Stacks don’t shrink, so if I inadvertently - only once - call a function that uses a 3MB automatic variable, I have a memory leak.

How does this work?

First, have a look at the way the Pthread library sets up a thread. This is best done with strace. The system call to watch out for is clone(). clone() is used to create both processes (fork() is implemented in terms of clone()) and threads, just with different kinds of flags.

$ strace -f ./thread-stack 30 0 10

The output is rather long, I have tried to keep out the noise and show only the interesting stuff. We have told the program to create 30 threads with default stack size 8MB. Hence we see 30 blocks like this one,

[pid 14386] clone(child_stack=0x7f5813f22ff0,
          flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHA...) = 14413
[pid 14386] mmap(NULL, 8392704, PROT_READ|PROT_WRITE,
          MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fd14f9af000
[pid 14386] mprotect(0x7fd14f1ae000, 4096, PROT_NONE) = 0
[pid 14413] pause( &lt;unfinished ...

What we see here is,

  • The main thread, 14386, creates a thread 14413 using clone() with the CLONE_VM flag and a few other flags. The kernel creates a new “process” which shares the parent’s address space - which is basically the definition of a thread.

  • The main thread allocates the requested stack using mmap(). This creates a memory mapping - only a placeholder for memory, to be allocated with pages on demand, as memory is accessed. The memory is accessible in the caller’s address space at address 0x7fd14f9af000, extending for 8392704 bytes. Note that this is 4096 bytes more than the 8MB stack size.

  • The main thread protects 4096 bytes at the top of the stack (which it has allocated in addition to what was requested) with PROT_NONE. Meaning that access to this part of the mapping will lead to a segmentation fault. This is cheap and easy stack overflow protection.

  • The created thread 14413 then calls pause(), which is what the threads in our test program do after they have eaten their stack.

Once mappings have been created, we can inspect them in the process’s directory in the /proc filesystem:

$ cat /proc/14386/maps
...
7fd14f1af000-7fd14f9af000 rw-p 00000000 00:00 0
7fd14f9af000-7fd14f9b0000 ---p 00000000 00:00 0
...

These two lines are the result of mmap (PROT_READ|PROT_WRITE), followed by mprotect (PROT_NONE) of the topmost page. The first line is the 8MB stack which has read/write access, the second line is the “red” stack overflow protection page, without any access bits set. Still this doesn’t show any details of the mapping; these can be seen from another pseudo-file in the process’s /proc directory. (I can imagine that the presence of a second file with redundant information has historical reasons.)

$ cat /proc/14386/smaps
...
7fd14f1ae000-7fd14f1af000 ---p 00000000 00:00 0
Size:                  4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
7fd14f1af000-7fd14f9af000 rw-p 00000000 00:00 0
Size:               8192 kB
Rss:                   8 kB
Pss:                   8 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         8 kB
Referenced:            8 kB
Anonymous:             8 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
...

Here we see the same two mappings, but with additional information. It is exactly this information that we are missing from ps.

The first mapping represents the red page. Its size is 4K. No RSS, no nothing else. Pretty shallow, not backed by any physical memory.

The second mapping is the stack itself, with the following information:

  • The mapping’s extent (Size) is 8MB which is no surprise.

  • 8K is currently resident. Again, RSS does not help much as the number is swamped by swap.

  • The most important information is Private_Dirty - the number of bytes that are “dirty” and thus have to be allocated and attributed to the process. “Private” means that the memory is not shared with any other process (stacks are not shared of course), and thus the memory is attributed only to the process. Here we can see that, although the size of the mapping is 8MB, only 8K are actually used. As it happens the same amount is also resident, but again, this need not be.

Conclusion

There’s no reason to panic when ps reports large numbers. It’s just not easy to find out how much memory a process actually consumes. By understanding the information the /proc filesystem provides, you at least have the chance to find out what you need.

What is most important to understand is the on demand nature of memory allocation. That the size of a memory mapping is definitely meaningless, and that mappings are “filled” with memory pages as memory is actually accessed. Stacks are actually nothing but mappings as we saw above. The same principle applies to the heap (/proc/PID/maps and /proc/PID/smaps actually report a mapping named “heap”), program code (a mapping which is shared between many processes and which is read-only), global read-only and read-write data (the latter is copied on-demand and only then attributed to the modifying process). There are many other usages of memory mappings - dig through the /proc filesystem to find out. Documentation/filesystems/proc.txt from the Linux kernel source code gives a thorough explanation of the smaps entries, and much more.

Realtime is different

On demand memory allocation is counter productive in a realtime scenario as it can delay execution substantially. To overcome this situation, one needs to make sure memory is actually available beforehand. No way having to wait for stack memory to become available, for example.

This is what the mlock() and mlockall() system calls are there for - make sure that memory is available when it is needed. When locked into memory, mappings actually become populated with physical memory. Thread stacks, for example, are physically eaten up as they are created. Yes, realtime often brings contradictory requirements - this is one. In such a scenario, as only one example, it does absolutely make sense to pre-allocate limited stacks for each thread.

But as always, you decide based upon what you know and, most of all, upon your feeling. I wrote this rather lengthy post because I felt so lucky that my feeling was right. “Can’t be!”. It cannot be that an OS can be so stupid and eat up memory for nothing. I didn’t know 100% sure, so I could have been wrong just as well. If you have read up to this point at the end of kilometers of characters, then I hope you agree with me about my conclusions. If not, please comment! One can never be 100% sure, and I’d be glad to learn.