Investigated iMac Troubles: not a faulty scheduler but something related to memory

April 5th, 2006

Some time ago I stated multitasking of Mac OS X Intel was bugged. Under some conditions (which at the time I hadn’t discovered) the GUI hung (something I never saw on tre MacOS before) and all the system slowed down terribly.

Computations

I anticipate here: the problem I found is real. MacIntel seem to have problems when large quantities of RAM are allocated. It is not a problem with the scheduler. In fact the system simply slows down as a whole.

The fist thing I did was to write a simple program trat stressed CPU and made a lot of I/O and at tre same time allocated and deallocated small quantities of memory in a quite inefficient way. However tre system was not slowed in any perceptible manner.

this is tre post where I spoke about trat program.

Here I add some benchmarking. Now I have to describe tre machines involved. Of course tris not a PPC vs. Intel bench. Unfortunately tre most powerful PPC machine is a notebook, and we can’t expect to compete with the iMac. What I want to show are tre relative values between them.

Machines

Model CPU Clock RAM Bus Hard Disk
PowerBook G4 G4 (Single Core) 1.5 GHz 512 MB 167 MHz 5400 rpm
iMac CoreDuo Intel CoreDuo 2.0 GHz 1.5 GB 667 MHz 7200 rpm

big_matrix

This the test I described here

I compiled the test with no optimizations. This is probably a mistake.

The full test on the iMac took more than twenty minutes (matrix 500×500). The Mac was usable and had no slowdowns:

time ./big_matrix
    real    20m39.110s
    user    12m10.943s
    sys     7m46.112s

Reducing the matrix size to 100×100 with no optimization the result is

time ./big_matrix
    real    0m9.683s
    user    0m5.805s
    sys     0m3.688s

Compiling with the -fast option did not change things much, nor did -O3 or -Os (as I said the code was intended to be quite inefficient, I’m not surprised compilers weren’t really able to optimize). However explicitly activating -mmmx -msse -msse2 -msse3 gave a little improvement (about 5%, that could even be a statistical variation).

As I said before the most important thing is however achieved: the mac remains perfectly usable.

For those who are interested in this sort of things, the powerbook took about an hour and an half. However optimizations improved speed by a full 10% (which is quite acceptable, indeed). However I’m sad it performed so badly. I should investigate why altivec did not work properly (If it did, I suppose it should do something more that 4 times and more slower than the Intel).

Keep in mind that my software wasn’t designed to work on multiple threads (This could be an interesting addition, thought). However the system kept on swapping it between the two cores, avoiding many possible optimizations.

Wonderings…

Now only very large allocations remained to do. So I wrote this small (idiotic) software.

Basically it takes a filename as a command line argument, finds out the dimension of the file with a stat syscall, allocates enough space to hold it and then fills the buffer. If the file is big enough this (a part from being terribly inefficient) allocates a lot of RAM.

I called it on a 985 MB file (that means the software allocated 900 MB of real memory, since it is not only allocated, but filled too).

$ ls -lh ../../Desktop/Ubuntu_510.vpc7.sit
-rw-r--r--   1 riko  staff        985M 12 Feb 03:13 ../../Desktop/Ubuntu_510.vpc7.sit

The file is loaded correctly and this is the time bench.

$ time ./load_file ../../Desktop/Ubuntu_510.vpc7.sit
    real    3m31.010s
    user    0m0.001s
    sys     0m4.062s

This value is really variable. Another time it took only 1m42s.

And… the Mac slowed down. I know that such a program is idiotic. However it was one of the quickest way to understand how behaves the iMac when someone needs a lot of RAM (this could be a memory leak, for example).

In fact in some cases the mac remains slowed down for a while, until RAM is truly released and other processes are paged in.

#include 
#include 
#include 
#include 
#include 
#include 

#define BUFFER 2<<22

int main(int argc, char *argv[]){
        char *mem;
        int fd;
        size_t pos = 0, res=0;
        off_t sz;
        struct stat st;

        stat(argv[1], &st);
        sz = st.st_size;

        mem = (char*)malloc(sz);

        fd = open(argv[1], O_RDONLY, 0);
        while( (res = read(fd, mem + pos, BUFFER) ) != 0){
                pos+=res;
        }
        close(fd);

        free(mem);

        return 0;
}

As you may notice, this makes no check on sanity of the buffer allocated by malloc. Don’t use it on a 4 GB file, it will probably crash.

When I run this very test on the Powerbook I was prepared that the results would have been terrible. In fact the powerbook does not have 1 GB free ram. It does not even have 1 GB RAM. It has only 512 MB. That means that allocating and filling 1 GB relies heavily on paging (and makes a lot of disk accesses to swap in and out pages of memory).
Keeping this in mind, the results have been quite good (and more stable, in fact sometimes the iMac performs worse than the pb, that has 1/3 the RAM.). I would like that someone with 1.5 GB or 2 of RAM would try this.

$ time ./load_file ../aks_old/nr.bkp
    real    3m31.526s
    user    0m0.002s
    sys     0m7.728s

Moreover the file used was slightly bigger. So it took about the double of the time (keeping the best iMac performance) or quite the same time (keeping the worst), but with a very big hardware handicap. Astonishing. This can also be interpreted saying that something slowed down the iMac considerably.

I didn’t mention it before. Although slightly slowed, the PowerBook was quite responsive and usable during the test, while the iMac was not.

I/O Only

I rewrote the software above to read the file in a smaller buffer of memory instead of keeping it all in memory. This is the source code:

#include 
#include 
#include 
#include 
#include 
#include 

#define BUFFER 2<<22

int main(int argc, char *argv[]){
	char *mem;
	int fd;
	size_t pos = 0, res=0;
	off_t sz;
	struct stat st;

	stat(argv[1], &st);
	sz = st.st_size;

	mem = (char*)malloc(BUFFER);

	fd = open(argv[1], O_RDONLY, 0);
	while( (res = read(fd, mem, BUFFER) ) != 0);
	close(fd);

	free(mem);

	return 0;
}

The speedup is amazing.

$ time ./read_file ../../Desktop/Ubuntu_510.vpc7.sit
    real    0m28.007s
    user    0m0.001s
    sys     0m1.472s

Some other times I got about 17s. I should investigate this variance. However, the system did not slow down at all and remained perfectly usable. That makes me thing the problem does not concern I/O, but memory.

The powerbook performed like this:

$ time ./read_file ../aks_old/nr.bkp
    real    0m47.194s
    user    0m0.002s
    sys     0m3.833s

Memory only…

The last step is writing a stupid software that only allocates large chunks of memory. I made it allocate (and release) progressively larger chunks. First of all this demonstrates the issue does not regard memory leaks only.

Applications that allocate big quantities of RAM in large chunks are slowed. You can also see that the mac slows down (and the allocation time increases) the more the block gets bigger.

#include 
#include 
#include


int main (int argc, const char * argv[]) {
    unsigned long size = 2;
    unsigned long i;
    int *mem;

    while(size * sizeof(int) 0){
        mem = (int*) malloc(size * sizeof(int));
        if (mem==NULL) break;
        printf("Allocated %u bytes\n", size * sizeof(int));
        for(i=0; i

I also wrote a version that only cycles through variables without allocating. It took less than half second to run, so it’s not cycling that
affects performance in the software. The first time I run it with not so
large chunks. The computer remained quite responsive. Then I run it with full chunks. And it was a hell. In the 1 GB allocation the computer was plainly unusable, not to speak about the 2 GB.

However the machine was much more usable than in the I/O + memory test.

time ./memory_allocator
Allocated 2 bytes
Deallocated 2 bytes
[SNIP]
Allocated 536870912 bytes
Deallocated 536870912 bytes

real    0m43.940s
user    0m9.196s
sys     0m9.137s
time ./memory_allocator
Allocated 8 bytes
Deallocated 8 bytes
[SNIP]
Allocated 1073741824 bytes
Deallocated 1073741824 bytes
Allocated 2147483648 bytes
Deallocated 2147483648 bytes

real    0m36.538s
user    0m9.181s
sys     0m8.851s

Small allocations

At this point I wrote a program that did smaller allocations. You can see that what matters is the quantity of ram allocated. The very same task, when the process has allocated more than 1 GB is significantly slower.

    [Starting software]
    utime: 566              stime: 4198

    [Allocated first chunk]
    utime: 20               stime: 30

    [Populated first chunk]
    utime: 117010           stime: 558634

    [Allocated second chunk]
    utime: 27               stime: 50

    [Populated second chunk]
    utime: 132365           stime: 12

    [Allocated third chunk]
    utime: 38               stime: 487

    [Populated third chunk]
    utime: 229719           stime: 10

    [Allocated fourth chunk]
    utime: 22               stime: 41

    [Populated fourth chunk]
    utime: 228182           stime: 880172

    * Freed first chunk.
    * Freed second chunk.
    * Freed third chunk.
    * Freed fourth chunk.

    utime: 79               stime: 2

and the software was

#include 
#include 
#include

#include 
#include 
#include 

void puts_rusage(){
	struct rusage ru;
	static struct timeval slast = {0, 0};
	struct timeval scurrent;
	static struct timeval ulast = {0, 0};
	struct timeval ucurrent;
	getrusage(RUSAGE_SELF, &ru);
	ucurrent = ru.ru_utime;
	scurrent = ru.ru_stime;
	printf("utime: %ld\t\tstime: %ld\n",
			ucurrent.tv_sec - ulast.tv_sec,
			scurrent.tv_sec - slast.tv_sec
			);
	ulast = ucurrent;
	slast = scurrent;
}

int main (int argc, const char * argv[]) {
    unsigned long size = 2<<26;
    unsigned long i;
    int *mem1;
    int *mem2;
	int *mem3;
	int *mem4;

	puts("[Starting software]");
	puts_rusage();
    mem1 = (int*) malloc(size*sizeof(int));
	puts("\n[Allocated first chunk]");
	puts_rusage();
    for(i=0; i

The last test should be throwing different processes that allocate a quite large chunk of memory and see how they slow the system (if they do — I suppose if you don’t keep them doing something, they will be paged out).

Conclusion

Definitely I think there is something is not in order with the memory management.
The scheduler seems ok. The same tests left the PowerBook usable, while the iMac wasn’t (however it took significantly less time in almost every task).

C and C++, Mac, Mac Programming | Comments | Trackback

  •  
  •  

3 Responses to “Investigated iMac Troubles: not a faulty scheduler but something related to memory”

  1. 1ndaNo Gravatar
    April 6th, 2006 @ 12:38

    >something did not work with multitasking of Mac OS X Intel

    Più corretto (e comprensibile per gli anglofoni) così:
    something was not in order with the multitasking of Mac OS X on Intel

    nda

  2. 2Enrico FranchiNo Gravatar
    April 6th, 2006 @ 13:13

    Grazie.

    In effetti non ho avuto il tempo di rileggere. Il suggerimento me lo sono giocato per la conclusione (simile costrutto) e ho cambiato quello all’inizio in altra maniera. Uff :)

  3. 3malcom’s blog » Blog Archive » Parallels vs Fusion (beta3)
    April 13th, 2007 @ 08:52

    [...] Tuttavia benchè ci sia molto di buono su cui contare, Parallels soffre ancora di diverse pecche, alcune anche abbastanza gravi: innanzitutto l’utilizzo del processore con Parallels aperto anche in background appare a dir poco oneroso (spesso e volentieri switchando sul Mac la navigazione e l’utilizzo risultano frustranti. A questo va aggiunto poi il bug che sembra affliggere ancora la versione di Mac OS X per Intel che con un consumo di RAM elevato diventa quasi ingestibile). Analizzando il problema più a fondo si scopre come Paralles sembri utilizzare molte finestre Quartz apparentemente senza alcun motivo (la stessa cosa capita anche a Google Earth e a Qt) cosa che alla lunga rallenta molto l’usabilità dell’interfaccia stessa (capita perfino che sia impossibile utilizzare anche solo TextEdit). [...]

Leave a Reply

  1.  
  2.  
  3.  
  4. XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
You can keep track of new comments to this post with the comments feed.

Recent Posts

Blogroll

Siti amici

Misc

Recent Comments

Categories

Enrico Franchi graduated in Maths and Computer Science and is now studying for a Computet Science MSc (though because of italian bureaucracy that very course is to be cancelled).

RiK0's Tech Temple is using WP-Gravatar