You are currently browsing the category archive for the 'C and C++' category.
Queste sono le FAQ del newsgroup italiano dedicato a C++.
Mi sono reso conto che in effetti non è facilissimo trovarle,
ache se contengono molte informazioni che risparmierebbero
domande banali sul newsgroup.
Per cui, nel caso, le linko.
Link.
Queste sono le FAQ del newsgroup italiano dedicato a C++.
Mi sono reso conto che in effetti non è facilissimo trovarle,
ache se contengono molte informazioni che risparmierebbero
domande banali sul newsgroup.
Per cui, nel caso, le linko.
Link.
In tempi di Duck Typing ridiventa certamente di moda la distinzione fra classi e tipi che si trova per esempio nel capitolo introduttivo di Design Patterns di Gamma, Helm, Johnson e Vlissides.
La maggior parte delle persone sono oggi abituate a linguaggi come Java o C++ dove classi e tipi in buona sostanza coincidono (con la possibile eccezione della programmazione generica).
Un tipo è il nome usato per denotare una particolare interfaccia. Questo non ha direttamente a che vedere con le Interfaces del Java, per inciso. L’interfaccia è l’insieme di metodi (o di messaggi, per dirla con Ruby, Smalltalk o ObjectiveC) cui un oggetto risponde.
Tuttavia un oggetto può implementare diversi tipi, e tipi diversi possono essere condivisi da oggetti molto diversi fra loro. In questo i tipi si comportano come le interfacce Java. Qualunque programmatore Java sa che ci sono Interfacce (uso il maiuscolo per indicare che intendo la parola nel senso “Javesco”) supportate da diversi oggetti (leggi Cloneable, per esempio) e che un qualunque oggetto può implementare un numero grande a piacere di Interfacce.
La differenza è che le Interfacce di Java sono statiche e “legate” alle classi. Una data classe dice di implementare un certo numero di interfacce. Le istanze di quella classe avranno quindi per tipo le varie interfacce.
In Ruby tuttavia questo non accade. Un oggetto ha un dato tipo perché risponde a determinati messaggi, ma non specifichiamo da nessuna parte quale sia il tipo. Per il solo fatto di rispondere a certi messaggi un oggetto è di un dato tipo (a prescindere dalla sua classe). I tipi sono “informali” (usando la stessa differenza fra protocolli formali e informali di ObjectiveC — i protocolli formali si comportano sostanzialmente in modo simile alle Interfacce di Java, quelli informali sono appunti “non staticamente tipizzati”).
La classe è invece legata direttamente all’implementazione. Citando DP del GOF
“È importante capire la differenza fra la classe di un oggetto e il suo tipo. La classe definisce come è stato implementato. Definisce il suo stato interno e l’implementazione delle sue operazioni. D’altra parte il tipo di un oggetto si riferisce solo alla sua interfaccia, all’insieme di richieste a cui può rispondere. Un oggetto può avere molti tipi e oggetti di classi differenti possono avere lo stesso tipo
Naturalmente c’è una una stretta parentela fra classe e tipo. Poiché una classe definisce le operazioni che un oggetto è in grado di eseguire, definisce anche il suo tipo. Quando diciamo che un oggetto è un’istanza di una classe, diciamo implicitamente che l’oggetto supporta l’interfaccia definita dalla classe.
Linguaggi come C++ e Eifell (e Java ndt) usano le classi per specificare
sia il tipo di un oggetto che la sua implementazione”
D’altra parte linguaggi dinamici come Ruby o Python (o Smalltalk) non dichiarano i tipi delle variabili. Mandano messaggi (o chiamano metodi, per dirla con Python) agli oggetti denotati dalle variabili e se tali oggetti supportano il messaggio, tutto funzionerà.
La differenza fra ereditarietà di classe e ereditarietà di tipo è importante. L’ereditarietà di classe riguarda definire l’implementazione di un oggetto attraverso l’implementazione di un altro oggetto. È senza dubbio un concetto “DRY”. Se ho già definito delle cose (e gli oggetti sono sufficientemente vicini) posso mantenerle uguali. In Ruby a questo si associano i mixin che permettono di condividere codice *senza* andare in ereditarietà (ma questa è un’altra storia), in Python si può usare l’ereditarietà multipla per emulare i mixin.
L’ereditarietà di tipo è invece relativa all’utilizzare un oggetto al posto di un altro. In questo senso siamo più vicini al Principio di Sostituzione di Barbara Liskov. Un ottimo articolo relativo al LSP si trova qui . In buona sostanza possiamo enunciare il Principio di Sostituzione di Liskov come:
T1 è un sottotipo di T2 se per ogni oggetto O1 di tipo T1 esiste un oggetto O2 di tipo T2 tale che per ogni programma definito in termini di T1 il comportamento del programma è invariato sostituendo O2 al posto di O1.
Confondere ereditarietà di tipo e di classe è facile. Molti linguaggi non hanno alcuna distinzione fra le due. Anzi, vengono usate le classi per definire i tipi. Molti programmatori per esempio vedono il LSP solo in relazione alle sottoclassi (in effetti in C++ è importante fare in modo tale che gli oggetti si comportino “bene” in relazione al LSP, in quanto è sempre possibile fare puntare un puntatore o una reference di un supertipo ad un oggetto del suo sottotipo).
In realtà il principio di Liskov è una cosa *molto* severa. Due oggetti sono sostituibili secondo Liskov se davvero (limitatamente al tipo comune) si comportano allo stesso modo, e viene naturale pensarli come classe - sottoclasse (anche se effettivamente questo *non* è necessario).
Il DuckTyping è meno severo: non chiede che il programma abbia lo stesso comportamento. Due oggetti sono appunto sostituibili se rispondono alle stesse cose, anche se rispondono in modo diverso. Non ha *nulla* a che fare con il Design By Contract: è molto più vicino a quello che in C++ si fa con i templates (che alcuni in effetti vedono come il corrispettivo statico del Duck Typing).
Tornando a noi, se un oggetto si comporta secondo un certo tipo (ovvero risponde ai metodi propri di quel tipo), allora trattiamolo come tale. Da cui: se si comporta come una papera, trattiamolo come una papera (Duck Typing, appunto).
Un linguaggio come Ruby rende *molto* problematico considerare come stretta la relazione fra tipi e classi. In ogni punto del programma (anche se non è sempre buona pratica farlo) possiamo cambiare il tipo di tutti gli oggetti di una data classe (aggiungendo o togliendo metodi alla stessa), o singolarmente ad un singolo oggetto. Sintomatico è in questo senso avere deprecato il metodo Object#type in favore di Object#class.
The quick-sort algorithm
Quick-sort is a sorting algorithm developed by Hoare that in average does O(n log n). The O() notation means that if the list length is n, the number of operations used to sort the list is in the order of magnitudo of n log n (in this case).
To be more formal, we shall say that there exist two constants c1 and c2 such that the number of operations m is
c1 * n * log n <= m <= c2 * n * log n. Notice that c1 and c2 are not specified in the O notation
We stop with math here. If you want more precise informations about the O notation read the wikipedia.
Now I’m going to briefly explain the quick-sort algorithm, and then we will use that algorithm to exploit the expressive power of some programming languages. Keep in mind that more expressive does not necessarily mean faster to run. It means faster to write. But being faster to write means that we can try more efficient algorithms, thus improving performance in a more effective way.
Expressive languages are usually high-level. In fact we can also profile the code, track the bottleneck and rewrite only a couple of functions in C or in ASM. This lead to fast programs with fast development cycles.
Quick-sort strategy
Quick-sort uses a divide and conquer approach. We divide a list in two sublists using a pivot element. All the elements less than the pivot come before it, those greater after. Then we recursively apply quick-sort to the two sublists.
Notice that the pivot is chosen arbitrarily. Some implementations chose the first element of the list. Some chose it randomly, some use other criteria.
Moreover there are implementations of quick-sort that sort the list in place (that is to say memory used is bounded), some need extra storage. Guess which are the more efficient.
On wikipedia you can find more informations about the quick-sort algorithm. You can find also some implementations, and a link to a page full of implementations in different languages.
I almost did not write code. I took code from the wikipedia or around the web and I’m going to comment it. This is the “new” work I did. Comments, explanations, not code. I also wrote a couple of implementations, however I don’t remember which ones.
Python
2.4.2 (#1, Mar 23 2006, 22:00:44)
[GCC 4.0.1 (Apple Computer, Inc. build 5250)]
Now lets examine a Python high level implementation.
def qsort(L):
if L == []: return []
return qsort([x for x in L[1:] if x< L[0]]) + L[0:1] + \
qsort([x for x in L[1:] if x>=L[0]])
This is the exact description of the quick-sort algorithm, if you know what Python list comprehension are. When I write [x for x in L if p(x)], Python computer the list of elements of the L container that make p(x) true.
L[a:b] is the sublist of L [L[a], L[a+1], …, L[b-1], L[b]]. L[a:]==L[a:len(L)], L[:b]==L[0:b]
[x for x in L[1:] if x< L[0]] is the sublist of the elements of L a part from the first one (that is the pivot) such that x is less than the pivot (L[0]).
[x for x in L[1:] if x>=L[0]] is the sublist of the elements of L a part from the first one (that is the pivot) such that x is greater or equal than the pivot (L[0]).
Then we pass those sublists to qsort that recursively sorts them, and eventually we return a append all those sublists and return them.
This implementation, although beautiful to read and understand is horrible from a performance point of view. It creates, concatenates and destroys a large number of lists. Moreover it is purely recursive. For large lists, we can even exhaust the stack space.
A version of quick-sort in Python that has no need for more memory is this
def partition(array, begin, end, cmp):
while begin < end:
while begin < end:
if cmp(array[begin], array[end]):
(array[begin], array[end]) = (array[end], array[begin])
break
end -= 1
while begin < end:
if cmp(array[begin], array[end]):
(array[begin], array[end]) = (array[end], array[begin])
break
begin += 1
return begin
def sort(array, cmp=lambda x, y: x > y, begin=None, end=None):
if begin is None: begin = 0
if end is None: end = len(array)
if begin < end:
i = partition(array, begin, end-1, cmp)
sort(array, cmp, begin, i)
sort(array, cmp, i+1, end)
If you want to run this code, remember that in Python indentation matters. If you just copy and paste the code, you may end with unusable code.
Anyway, I benchmarked all of them. “Elegant quicksort” is the first implementation. “Standard quicksort” is the latter. “Bultin sort” is standard python sort algorithm (that is not a quicksort, but a dedicated sorting algorithm).
Elegant qsort: [21.20, 19.75, 20.08, 20.46]
Builtin qsort: [1.61, 2.29, 2.29, 2.29]
Standard qsort: [31.15, 30.60, 30.68, 30.38]
Probably due to heavy optimization performed in list comprehension the “elegant” algorithm is faster than the standard one. However both of them are more than ten times slower than the python built-in sort algorithm.
We have learnt:
- High level languages are great to express algorithms in a readable way (and we can use them to test them and see how they work and all)
- You should not implement “basic” algorithms in high level languages: use their built-ins or write them in C (most built-ins are written in C).
C
Now it is time for a couple of pure C quick-sort.
void qsort1(int a[], int lo, int hi ){
int h, l, p, t;
if (lo < hi) {
l = lo;
h = hi;
p = a[hi];
do {
while ((l < h) && (a[l] <= p))
l = l+1;
while ((h > l) && (a[h] >= p))
h = h-1;
if (l < h) {
t = a[l];
a[l] = a[h];
a[h] = t;
}
} while (l < h);
t = a[l];
a[l] = a[hi];
a[hi] = t;
qsort( a, lo, l-1 );
qsort( a, l+1, hi );
}
}
Another implementation is
void quicksort(int x[], int first, int last) {
int pivIndex = 0;
if(first < last) {
pivIndex = partition(x,first, last);
quicksort(x,first,(pivIndex-1));
quicksort(x,(pivIndex+1),last);
}
}
int partition(int y[], int f, int l) {
int up,down,temp;
int piv = y[f];
up = f;
down = l;
goto partLS;
do {
temp = y[up];
y[up] = y[down];
y[down] = temp;
partLS:
while (y[up] <= piv && up < l) {
up++;
}
while (y[down] > piv && down > f ) {
down–;
}
} while (down > up);
y[f] = y[down];
y[down] = piv;
return down;
}
C also have a builtin quick-sort, that is much more advanced than this ones, since it allows to specify a comparison function, thus working with all kinds of data types.
Benchmarking these was hard. They too fast to be benchmarked with a list of a million elements and the standard posix clock function. However if you want to run the bench yourself, I could not use MacOS X specific bench functions.
So I used a list of 100000000 elements.
This is the (ugly) code I used for benchmarking. The test is not really interesting. Of course since C integers are machine integers it is dramatically faster than other solutions. It could have been more interesting if we sorted other data structures, where the gap could be reduced.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "qsort1.h"
#include "qsort2.h"
#define LEN 100000000
int intcmp(const void *a, const void *b){
return *(int*)a - *(int*)b;
}
int main(){
size_t index;
int *l = malloc(sizeof(int)*LEN);
clock_t last, current;
srand(time(NULL));
for(index = 0; index < LEN; ++index){
l[index] = rand();
}
last = clock();
qsort(l,LEN, sizeof(int), intcmp);
current = clock();
printf("C qsort: %f\n", (current-last)/(float)CLOCKS_PER_SEC);
for(index = 0; index < LEN; ++index){
l[index] = rand();
}
last = clock();
qsort1(l,0, LEN-1);
current = clock();
printf("int qsort: %f\n", (current-last)/(float)CLOCKS_PER_SEC);
for(index = 0; index < LEN; ++index){
l[index] = rand();
}
last = clock();
quicksort(l,0, LEN-1);
current = clock();
printf("qsort 2: %f\n", (current-last)/(float)CLOCKS_PER_SEC);
free(l);
return 0;
}
I have just finished my first autotools project on the MacOS. Until today I deployed software only in Python or Java. I already worked on C and C++ projects that used libtool and automake and autoconf, but I had not to write the configure scripts myself or I developed them on GNU/Linux.
In fact I’m afraid most of the problems I encountered was MacOS X fault + my ignorance. That is to say libtoolize does not exist: it is called glibtoolize. So if I do not rename it, autoreconf simply does not work.
Probably I just had to set the LIBTOOLIZE environment variable to glibtoolize. However my solution works. The error I got was
Can't exec "libtoolize": No such file or directory at /opt/local/share/autoconf/Autom4te/FileUtils.pm line 288, <GEN3> line 3.
autoreconf: failed to run libtoolize: No such file or directory
Well… I’ve to fine tune it. I go.
Some time ago I stated multitasking of Mac OS X Intel was bugged. Under some conditions (which at the time I hadn’t discovered) the GUI hung (something I never saw on tre MacOS before) and all the system slowed down terribly.
Computations
I anticipate here: the problem I found is real. MacIntel seem to have problems when large quantities of RAM are allocated. It is not a problem with the scheduler. In fact the system simply slows down as a whole.
The fist thing I did was to write a simple program trat stressed CPU and made a lot of I/O and at tre same time allocated and deallocated small quantities of memory in a quite inefficient way. However tre system was not slowed in any perceptible manner.
this is tre post where I spoke about trat program.
Here I add some benchmarking. Now I have to describe tre machines involved. Of course tris not a PPC vs. Intel bench. Unfortunately tre most powerful PPC machine is a notebook, and we can’t expect to compete with the iMac. What I want to show are tre relative values between them.
Machines
| Model | CPU | Clock | RAM | Bus | Hard Disk |
|---|---|---|---|---|---|
| PowerBook G4 | G4 (Single Core) | 1.5 GHz | 512 MB | 167 MHz | 5400 rpm |
| iMac CoreDuo | Intel CoreDuo | 2.0 GHz | 1.5 GB | 667 MHz | 7200 rpm |
big_matrix
This the test I described here
I compiled the test with no optimizations. This is probably a mistake.
The full test on the iMac took more than twenty minutes (matrix 500×500). The Mac was usable and had no slowdowns:
time ./big_matrix
real 20m39.110s
user 12m10.943s
sys 7m46.112s
Reducing the matrix size to 100×100 with no optimization the result is
time ./big_matrix
real 0m9.683s
user 0m5.805s
sys 0m3.688s
Compiling with the -fast option did not change things much, nor did -O3 or -Os (as I said the code was intended to be quite inefficient, I’m not surprised compilers weren’t really able to optimize). However explicitly activating -mmmx -msse -msse2 -msse3 gave a little improvement (about 5%, that could even be a statistical variation).
As I said before the most important thing is however achieved: the mac remains perfectly usable.
For those who are interested in this sort of things, the powerbook took about an hour and an half. However optimizations improved speed by a full 10% (which is quite acceptable, indeed). However I’m sad it performed so badly. I should investigate why altivec did not work properly (If it did, I suppose it should do something more that 4 times and more slower than the Intel).
Keep in mind that my software wasn’t designed to work on multiple threads (This could be an interesting addition, thought). However the system kept on swapping it between the two cores, avoiding many possible optimizations.
Wonderings…
Now only very large allocations remained to do. So I wrote this small (idiotic) software.
Basically it takes a filename as a command line argument, finds out the dimension of the file with a stat syscall, allocates enough space to hold it and then fills the buffer. If the file is big enough this (a part from being terribly inefficient) allocates a lot of RAM.
I called it on a 985 MB file (that means the software allocated 900 MB of real memory, since it is not only allocated, but filled too).
$ ls -lh ../../Desktop/Ubuntu_510.vpc7.sit -rw-r--r-- 1 riko staff 985M 12 Feb 03:13 ../../Desktop/Ubuntu_510.vpc7.sit
The file is loaded correctly and this is the time bench.
$ time ./load_file ../../Desktop/Ubuntu_510.vpc7.sit
real 3m31.010s
user 0m0.001s
sys 0m4.062s
This value is really variable. Another time it took only 1m42s.
And… the Mac slowed down. I know that such a program is idiotic. However it was one of the quickest way to understand how behaves the iMac when someone needs a lot of RAM (this could be a memory leak, for example).
In fact in some cases the mac remains slowed down for a while, until RAM is truly released and other processes are paged in.
#include#include #include #include #include #include #define BUFFER 2<<22 int main(int argc, char *argv[]){ char *mem; int fd; size_t pos = 0, res=0; off_t sz; struct stat st; stat(argv[1], &st); sz = st.st_size; mem = (char*)malloc(sz); fd = open(argv[1], O_RDONLY, 0); while( (res = read(fd, mem + pos, BUFFER) ) != 0){ pos+=res; } close(fd); free(mem); return 0; }
As you may notice, this makes no check on sanity of the buffer allocated by malloc. Don’t use it on a 4 GB file, it will probably crash.
When I run this very test on the Powerbook I was prepared that the results would have been terrible. In fact the powerbook does not have 1 GB free ram. It does not even have 1 GB RAM. It has only 512 MB. That means that allocating and filling 1 GB relies heavily on paging (and makes a lot of disk accesses to swap in and out pages of memory).
Keeping this in mind, the results have been quite good (and more stable, in fact sometimes the iMac performs worse than the pb, that has 1/3 the RAM.). I would like that someone with 1.5 GB or 2 of RAM would try this.
$ time ./load_file ../aks_old/nr.bkp
real 3m31.526s
user 0m0.002s
sys 0m7.728s
Moreover the file used was slightly bigger. So it took about the double of the time (keeping the best iMac performance) or quite the same time (keeping the worst), but with a very big hardware handicap. Astonishing. This can also be interpreted saying that something slowed down the iMac considerably.
I didn’t mention it before. Although slightly slowed, the PowerBook was quite responsive and usable during the test, while the iMac was not.
I/O Only
I rewrote the software above to read the file in a smaller buffer of memory instead of keeping it all in memory. This is the source code:
#include#include #include #include #include #include #define BUFFER 2<<22 int main(int argc, char *argv[]){ char *mem; int fd; size_t pos = 0, res=0; off_t sz; struct stat st; stat(argv[1], &st); sz = st.st_size; mem = (char*)malloc(BUFFER); fd = open(argv[1], O_RDONLY, 0); while( (res = read(fd, mem, BUFFER) ) != 0); close(fd); free(mem); return 0; }
The speedup is amazing.
$ time ./read_file ../../Desktop/Ubuntu_510.vpc7.sit
real 0m28.007s
user 0m0.001s
sys 0m1.472s
Some other times I got about 17s. I should investigate this variance. However, the system did not slow down at all and remained perfectly usable. That makes me thing the problem does not concern I/O, but memory.
The powerbook performed like this:
$ time ./read_file ../aks_old/nr.bkp
real 0m47.194s
user 0m0.002s
sys 0m3.833s
Memory only…
The last step is writing a stupid software that only allocates large chunks of memory. I made it allocate (and release) progressively larger chunks. First of all this demonstrates the issue does not regard memory leaks only.
Applications that allocate big quantities of RAM in large chunks are slowed. You can also see that the mac slows down (and the allocation time increases) the more the block gets bigger.
#include#include #include int main (int argc, const char * argv[]) { unsigned long size = 2; unsigned long i; int *mem; while(size * sizeof(int) 0){ mem = (int*) malloc(size * sizeof(int)); if (mem==NULL) break; printf(”Allocated %u bytes\n”, size * sizeof(int)); for(i=0; i I also wrote a version that only cycles through variables without allocating. It took less than half second to run, so it’s not cycling that
affects performance in the software. The first time I run it with not so
large chunks. The computer remained quite responsive. Then I run it with full chunks. And it was a hell. In the 1 GB allocation the computer was plainly unusable, not to speak about the 2 GB.However the machine was much more usable than in the I/O + memory test.
time ./memory_allocator Allocated 2 bytes Deallocated 2 bytes [SNIP] Allocated 536870912 bytes Deallocated 536870912 bytes real 0m43.940s user 0m9.196s sys 0m9.137stime ./memory_allocator Allocated 8 bytes Deallocated 8 bytes [SNIP] Allocated 1073741824 bytes Deallocated 1073741824 bytes Allocated 2147483648 bytes Deallocated 2147483648 bytes real 0m36.538s user 0m9.181s sys 0m8.851sSmall allocations
At this point I wrote a program that did smaller allocations. You can see that what matters is the quantity of ram allocated. The very same task, when the process has allocated more than 1 GB is significantly slower.
[Starting software] utime: 566 stime: 4198 [Allocated first chunk] utime: 20 stime: 30 [Populated first chunk] utime: 117010 stime: 558634 [Allocated second chunk] utime: 27 stime: 50 [Populated second chunk] utime: 132365 stime: 12 [Allocated third chunk] utime: 38 stime: 487 [Populated third chunk] utime: 229719 stime: 10 [Allocated fourth chunk] utime: 22 stime: 41 [Populated fourth chunk] utime: 228182 stime: 880172 * Freed first chunk. * Freed second chunk. * Freed third chunk. * Freed fourth chunk. utime: 79 stime: 2and the software was
#include#include #include #include #include #include void puts_rusage(){ struct rusage ru; static struct timeval slast = {0, 0}; struct timeval scurrent; static struct timeval ulast = {0, 0}; struct timeval ucurrent; getrusage(RUSAGE_SELF, &ru); ucurrent = ru.ru_utime; scurrent = ru.ru_stime; printf(”utime: %ld\t\tstime: %ld\n”, ucurrent.tv_sec - ulast.tv_sec, scurrent.tv_sec - slast.tv_sec ); ulast = ucurrent; slast = scurrent; } int main (int argc, const char * argv[]) { unsigned long size = 2<<26; unsigned long i; int *mem1; int *mem2; int *mem3; int *mem4; puts("[Starting software]"); puts_rusage(); mem1 = (int*) malloc(size*sizeof(int)); puts("\n[Allocated first chunk]"); puts_rusage(); for(i=0; i The last test should be throwing different processes that allocate a quite large chunk of memory and see how they slow the system (if they do — I suppose if you don’t keep them doing something, they will be paged out).
Conclusion
Definitely I think there is something is not in order with the memory management.
The scheduler seems ok. The same tests left the PowerBook usable, while the iMac wasn’t (however it took significantly less time in almost every task).
/*! \class WindowsNT
* \brief Windows Nice Try.
* \author Bill Gates
* \author Several species of small furry animals gathered together
* in a cave and grooving with a pict.
* \version 4.0
* \date 1996-1998
* \bug It crashes a lot and requires huge amounts of memory.
* \bug The class introduces the more bugs, the longer it is used.
* \warning This class may explode in your face.
* \warning If you inherit anything from this class, you're doomed.
*/
class WindowsNT {};
Ci sono alcuni tipi di ottimizzazione che per un
umano sono “evidenti”, per una macchina no. Trovo molto interessanti i problemi nel campo della correttezza (ovvero cercatr di insegnare ad un computer a dimostrare se un sorgente è o meno corretto, se presenta certi errori, in generale capire cosa fa). Il problema gemello è
quello dell’ottimizzazione (si tratta per il computer di capire cosa fa
il tuo codice e di modificarlo in uno equivalente più veloce)
Per esempio prendiamo questa funzione:
int func(){
int i;
int x = 2<<10;
for ( i=0; i < MAX; ++i){
if (i>= 0 && sqrt(i) >= 0)
x/=i;
}
}
Ad occhio si vede benissimo che il check dell’if è oltremodo pesante.
*sai* che i è sempre maggiore o uguale di 0. Si può anche farne una
dimostrazione formale abbastanza agevole.
In pratica la guardia dell’if è inutile. Il check sul >= 0 non serve e
meno ancora sqrt. Nessuno dei due ha effetti collaterali (per il primo
si vede, è facile, ma per il secondo bisogna sapere come è fatta sqrt, a
sboccio non si modo di sapere che non setta un globale ad i, che stampa qualcosa o che ha un altro effetto collaterale, per cui
effettivamente non si può semplicemente eliminarla senza cambiare la
semantica del programma.
Dicevo. Senza ottimizzare, questo compila tenendo tutta la guardia
dell’if:
L3:
cmpl $0, -16(%ebp)
js L4
cvtsi2sd -16(%ebp), %xmm0
sqrtsd %xmm0, %xmm0
movapd %xmm0, %xmm1
leal LC0-"L00000000001$pb"(%ebx), %eax
movsd (%eax), %xmm0
ucomisd %xmm0, %xmm1
jae L7
jmp L4
C'è il primo confronto i>= 0
cmpl $0, -16(%ebp)
js L4
e c’è pure il confronto con la radice (sta usando dei registri mmx).
Bene… tutta la sezione L3 in effetti si può semplicemente togliere.
Ora in questo caso è semplicemente banale fare l’ottimizzazione. E in
effetti tutto il codice completo per la funzione di cui sopra
ottimizzato è:
.text
.globl _func
_func:
pushl %ebp
movl %esp, %ebp
xorl %eax, %eax
L2:
addl $1, %eax
cmpl $10, %eax
jne L2
popl %ebp
ret
.subsections_via_symbols
Lungo da solo quanto il controllo. E questo è quasi esattamente quello che
avrebbe scritto un programmatore:
Piccolo preludio, azzera il primo registro. Poi nel loop, incrementa il
primo registro (addl $1, %eax), confrontalo con il valore guardia
(nel sorgente MAX era una define a 10 ; cmpl $10, %eax), se è minore o uguale torna a L2 (jne L2). Esci.
Ha ottimizzato perfino x/=i. Siccome vede che con x non ci faccio nulla,
allora dice: che lo calcolo a fare? Fa solo il ciclo. Fosse stato più
furbo non avrebbe fatto manco quello. Stupido io a prendere l’esempio fatto male.
Modifichiamo quindi la funzione in modo che ritorni x, e che quindi x vada calcolato.
Ecco… vediamo che a questo punto non è più tanto furbo:
.text
.globl _func
_func:
pushl %ebp
movl %esp, %ebp
pushl %ebx
xorl %ecx, %ecx
movl $2048, %eax
pxor %xmm1, %xmm1
jmp L2
L3:
testl %ecx, %ecx
js L11
L2:
cvtsi2sd %ecx, %xmm0
sqrtsd %xmm0, %xmm0
ucomisd %xmm1, %xmm0
jb L11
cltd
idivl %ecx
L11:
addl $1, %ecx
cmpl $9, %ecx
jle L3
popl %ebx
popl %ebp
ret
.subsections_via_symbols
Ricompare la radice quadrata ( sqrtsd %xmm0, %xmm0 ). Il
programmatore che avesse scritto l’assembly a mano avrebbe scritto [
nel caso in cui questo loop sia molto ricorrente ] un codice
significativamente più veloce.
Ammesso che questo mio esempio è del tutto cretino, è vero che il
compilatore in genere sa su quale architettura certe istruzioni sono più
sgamate, ma è vero ancora che le tecniche per le dimostrazioni formali
sono ancora relativamente poco usate (anche perchè suddette
dimostrazioni dilatano di moltissimo il tempo di compilazione, sono
*molto* difficili da scrivere, e rischiano di portare bachi su bachi con
errori minimi). Questo è parte di quasi tutti gli ambienti, non solo del
gcc.
Come dire… l’umano sulle cose “furbe” resta più furbo, anche se sulla forza bruta non ha speranza.
This is the first test to understand what makes my MacIntel GUI terribly slow when some software run.
I wrote a small C++ program on that purpose. It makes some operations on matrices, but it does them in a quite inefficient way. When we multiply two matrices, it creates two vector objects, allocates memory for them, multiplies them, stores the result, and destroys them.
Of course this is not what you would be doing when implementing a matrix library, but we are trying to understand which software bugs are more problematic on the MacIntel and aren’t on the PPC (in my previous post I told running some buggy sw on the Powerbook wasn’t an issue, but made the MacIntel almost unusable).
This software does a lot of computations (it multiplies two quite large matrices of “points” — structures of three doubles). Points use vectorial product. It does a lot of allocation and deallocation of small area of memory and I added I/O making it log the result each time it multiplies two points (that is a lot of I/O, since when we multiply two 500 element matrices it makes 500*500*500 point multiplications).
The verdict is positive. It is running right now (it hasn’t yet finished) but the computer is perfectly usable.
Now I’m going to write a software that allocates and deallocates large areas of memory and then one that leaks a lot of memory. I do think, memory leaks are the cause of the “GUI slowness”. However I still don’t understand why the powerbook (that has 1/3 the RAM my iMac has) had no troubles.
Here I post the source of this simple software. Keep in mind that this was designed to be inefficient (don’t use it, it sucks).
#include <cstdlib>
#include <iostream>
#include <ostream>
#include <fstream>
std::ofstream LOG;
struct Point{
double x;
double y;
double z;
Point() :
x(0.0),
y(0.0),
z(0.0)
{}
Point(double l_x, double l_y, double l_z) :
x(l_x),
y(l_y),
z(l_z)
{}
static Point random(){
Point p(
rand()/(double)RAND_MAX,
rand()/(double)RAND_MAX,
rand()/(double)RAND_MAX
);
return p;
}
};
std::ostream& operator<<(std::ostream& out, const Point& p){
out << "(" << p.x << "," << p.y << "," << p.z << ")" ;
return out;
}
Point operator*(const Point& op1, const Point& op2){
Point res(
op1.y * op2.z - op1.z * op2.y,
op1.z * op2.x - op1.x * op2.z,
op1.x * op2.y - op1.y * op2.x
);
LOG << res << std::endl;
return res;
}
Point operator+(const Point& op1, const Point& op2){
Point res(
op1.x + op2.x,
op1.y + op2.y,
op1.z + op2.z
);
return res;
}
template<typename T, size_t size>
class Vector{
public:
Vector(){
mem_ = new T[size];
}
Vector(const Vector<T, size>& o){
mem_ = new T[size];
for(size_t i=0; i<size; ++i){
mem_[i]=o.mem_[i];
}
}
~Vector(){
delete [] mem_;
}
T& operator[](size_t i){
return mem_[i];
}
const T& operator[](size_t i) const{
return mem_[i];
}
private:
T* mem_;
Vector<T, size>& operator=(const Vector<T, size>&);
};
template<typename T, size_t size>
T operator*(const Vector<T, size>& op1, const Vector<T, size> &op2){
T acc;
for(size_t i=0; i<size; ++i){
acc = acc + op1[i]*op2[i];
}
return acc;
}
template<typename T, size_t size>
class Matrix{
public:
Matrix(){
mem_ = new T[size*size];
for(size_t i=0; i<size; ++i){
for(size_t j=0; j<size; ++j){
get(i,j)=T::random();
}
}
}
Matrix(const Matrix<T, size>& o){
mem_ = new T[size*size];
for(size_t i=0; i<size*size; ++i){
mem_[i] = o.mem_[i];
}
}
~Matrix(){
delete [] mem_;
}
Matrix<T, size>& operator=(const Matrix<T, size>& o){
for(size_t i=0; i<size*size; ++i){
mem_[i] = o.mem_[i];
}
return (*this);
}
T& get(size_t i, size_t j){
return mem_[j*size + i];
}
const T& get(size_t i, size_t j) const{
return mem_[j*size + i];
}
Vector<T, size> col(size_t j) const{
Vector<T, size> v;
for(size_t i=0; i<size; ++i){
v[i] = get(i,j);
}
return v;
}
Vector<T, size> row(size_t i) const{
Vector<T, size> v;
for(size_t j=0; j<size; ++j){
v[j] = get(i,j);
}
return v;
}
private:
T* mem_;
};
template<typename T, size_t size>
Matrix<T, size> operator*(Matrix<T, size>& op1,
Matrix<T, size>& op2)
{
Matrix<T, size> res;
for(size_t i=0; i<size; ++i){
for(size_t j=0; j<size; ++j){
res.get(i,j) = op1.row(i) * op2.col(j);
}
}
return res;
}
int main(){
const size_t SIZE = 500;
LOG.open("log.txt");
Matrix<Point, SIZE> m1;
Matrix<Point, SIZE> m2;
Matrix<Point, SIZE> res;
res= m1*m2;
LOG.close();
}
Good news. With gmp 4.2 assembly optimization works. That means that you can get decent performances. For values of decent that are *below* those of an old Prescott and just a bit better than those of a plain Pentium M with the same clock.
The problem is that (for example) you can’t run make check. This makes me thing something is broken. However, I can’t understand what. However I’ve been told that “MacIntels” are not supported by gmp-4.2 . So consider twice before buying a MacIntel if you need to work with gmp.
And you can’t use c++ too. For some reason there is an error with the generation of an assembly optimization. The answer from the developers has been “gmp-4.2 is not supported on MacIntels” (however I am not really able to consider this a solution to the problem, but unfortunately I’m not skilled enought to fix things by myself).
In fact the second core is not used at all, so this result is quite predictable. Moreover I used shared libraries instead of static ones (for the very good reasons that the guys at apple don’t ship the gcc with the static version of libgcc and of crt0.o, so there is no easy way to do it).
In the end I assume MacOS X on Intel is young and probably not as optimized as a FreeBSD (just to name one). These are the results:
iMac 2 GHz 2 GB
***** GMPbench version 0.1 ***** Using default CFLAGS = "-O3 -fomit-frame-pointer -I../gmp-4.2" Using default CC = "gcc" Using default LIBS = "-lgmp -L../gmp-4.2/.libs" Using compilation command: gcc -O3 -fomit-frame-pointer -I../gmp-4.2 foo.c -o foo -lgmp -L../gmp-4.2/.libs You may want to override CC, CFLAGS, and LIBS Using gmp version: 4.2 Compiling benchmarks Running benchmarks Category base Program multiply multiply 128 128 GMPbench.base.multiply.128,128 result: 9530908 multiply 512 512 GMPbench.base.multiply.512,512 result: 1150785 multiply 8192 8192 GMPbench.base.multiply.8192,8192 result: 12500 multiply 131072 131072 GMPbench.base.multiply.131072,131072 result: 228 multiply 2097152 2097152 GMPbench.base.multiply.2097152,2097152 result: 9.62 GMPbench.base.multiply result: 12463 Program divide divide 8192 32 GMPbench.base.divide.8192,32 result: 306090 divide 8192 64 GMPbench.base.divide.8192,64 result: 104119 divide 8192 128 GMPbench.base.divide.8192,128 result: 66800 divide 8192 4096 GMPbench.base.divide.8192,4096 result: 20668 divide 8192 8064 GMPbench.base.divide.8192,8064 result: 268859 divide 131072 8192 GMPbench.base.divide.131072,8192 result: 435 divide 131072 65536 GMPbench.base.divide.131072,65536 result: 242 divide 8388608 4194304 GMPbench.base.divide.8388608,4194304 result: 0.796 GMPbench.base.divide result: 5617.3 GMPbench.base result: 8367.1 Category app Program rsa rsa 512 GMPbench.app.rsa.512 result: 2755 rsa 1024 GMPbench.app.rsa.1024 result: 478 rsa 2048 GMPbench.app.rsa.2048 result: 72.6 GMPbench.app.rsa result: 457.26 GMPbench.app result: 457.26 GMPbench result: 1956
I had problems with almost all scientific libraries I tried.
gmp builds only if you use –host=none-apple-darwin. That means you are disabling assembly optimizations (and you probably wouldn’t want to . And that option was correctly set for plain gmp, but not for gmp-cxx-wrappers (Gregory Wrigh has no access to a Mac intel, now he should have fixed it).
cln is plainly broken. That means you can’t use GiNaC too. I think someone should buy cln’s developers a MacMini Intel.
If you want to use OCaml, you have to use a cvs special version, since the stable does not yet compile. The solution is here.
In fact I had lots of problems. The guys at darwin ports have been really nice (using dp is just the quickest way to install this kind of software), but there are *lots* of troubles. The same installation on my powerbook G4 went just fine (of course the took a lot to finish, but they did finish).
Portfile for ntl on MacIntel was broken too (now it’s fixed).
how much I do love C?

I have to admit I’m a libtool noob. However I know gcc and g++ 4.0.1 available with MacOS Tiger are able to compile c++ dynamic libraries. In fact this is done with flag -dynamiclib
However, gmp if configured with –enable-cxx and without disabling dyamic libraries fails. In fact it passes g++ the -shared option (that works for linux). If you compile plain gmp with no C++ support, the problem does not exist. I’m afraid it’s a bug with gmp libtool (I suppose you could fix it with autoreconf, but I did not try it).
Unfortunately I need c++ support since I have to work with ppl. The solution is to modify all occurrences of “-shared” in the configure file with “-dynamiclib“. And appears not to work.
The same hack worked with readline (but I’m afraid that’s because readline was written in C and somehow support for C dynamic libraries seems better on mac os x: I’d say “older”, so it’s more likely developers made it work.
However darwin ports is able to build it the correct way. You only have to specify somewhere that gmp-cxx-wrappers should build with –host=none-apple-darwin