Help!

i386 PDA patches use of %gs

 
  

Goto page Previous  1, 2, 3
Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> Kernel (archive) RSS
Next:  sbpcd.c: fix check_region to request_region  
Author Message
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 7:30 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: linux>kernel (more info?)

* Jeremy Fitzhardinge <jeremy.DeleteThis@goop.org> wrote:

> Ingo Molnar wrote:
> >well, the most important thing i believe you didnt test: the effect of
> >mixing two descriptors on the _same_ selector: one %gs selector value
> >loaded and used by glibc, and another %gs selector value loaded and used
> >by the kernel, intermixed. It's the mixing that causes the descriptor
> >cache reload. (unless i missed some detail about your testcase)
>
> But it doesn't mix different descriptors on the same selector; the GDT
> is initialized when the CPU is brought up, and is unchanged from then
> on. The PDA descriptor is GDT entry 27 and the userspace TLS entries
> are 6-8, so in the typical case %gs will alternate between 0x33 and
> 0xd8 as it enters and leaves the kernel.
>
> My test program does the same thing, except using GDT entries 6 and 7
> (selectors 0x33 and 0x3b).

no, that's not what it does. It measures 50000000 switches of the _same_
selector value, without using any of the selectors in the loop itself.
I.e. no mixing at all! But when the kernel and userspace uses %gs, it's
the cost of switching between two selector values of %gs that has to be
measured. Your code does not measure that at all, AFAICS.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.DeleteThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 7:40 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Ingo Molnar <mingo.TakeThisOut@elte.hu> wrote:

> > My test program does the same thing, except using GDT entries 6 and
> > 7 (selectors 0x33 and 0x3b).
>
> no, that's not what it does. It measures 50000000 switches of the
> _same_ selector value, without using any of the selectors in the loop
> itself. I.e. no mixing at all! But when the kernel and userspace uses
> %gs, it's the cost of switching between two selector values of %gs
> that has to be measured. Your code does not measure that at all,
> AFAICS.

for example, your test_fs() code does:

for(i = 0; i < COUNT; i++) {
asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs"
: "+m" (*offset): "r" (seg) : "memory");
sync();
}

that loads (and uses) a single selector value for %fs, and doesnt do any
mixed use as far as i can see.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 7:40 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy.DeleteThis@goop.org> wrote:

> [...] However, when I measured segment register use timings, I didn't
> see any dramatic costs associated with segment register use which
> would account for a 5% hit in your benchmark.

if by that measurement you mean time-segops.c, i dont think it correctly
measures 'mixed' use of different selector values for the same %gs
segment selector. And that's what i suggested for you to measure in
September, and that's what Eric's testcase triggers too.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.DeleteThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 15, 2006 7:50 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Ingo Molnar wrote:
> for example, your test_fs() code does:
>
> for(i = 0; i < COUNT; i++) {
> asm volatile("push %%fs; mov %1, %%fs; addl $1, %%fs:%0; popl %%fs"
> : "+m" (*offset): "r" (seg) : "memory");
> sync();
> }
>
> that loads (and uses) a single selector value for %fs, and doesnt do any
> mixed use as far as i can see.

I'm not sure what you're getting at. Each loop iteration is analogous
to a user->kernel->user transition with respect to the
save/reload/use/restore pattern on the segment register. In this case,
%fs starts as a null selector, gets reloaded with a non NULL selector,
and then is restored to null. Do you mean some other mixing?

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo DeleteThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 7:50 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy.TakeThisOut@goop.org> wrote:

> > that loads (and uses) a single selector value for %fs, and doesnt do
> > any mixed use as far as i can see.
>
> I'm not sure what you're getting at. Each loop iteration is analogous
> to a user->kernel->user transition with respect to the
> save/reload/use/restore pattern on the segment register. In this
> case, %fs starts as a null selector, gets reloaded with a non NULL
> selector, and then is restored to null. Do you mean some other
> mixing?

yeah, mixed use: i.e. set up /two/ selector values and load them into
%gs and read+write memory through them. It might not change the results,
but that's what i meant under 'mixed use'.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 7:50 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy RemoveThis @goop.org> wrote:

> Ingo Molnar wrote:
> > no, that's not what it does. It measures 50000000 switches of the _same_
> > selector value, without using any of the selectors in the loop itself.
> > I.e. no mixing at all! But when the kernel and userspace uses %gs, it's
> > the cost of switching between two selector values of %gs that has to be
> > measured. Your code does not measure that at all, AFAICS.
> >
> I think you're misreading it. This is the inner loop:
>
> for(i = 0; i < COUNT; i++) {
> asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs"
> : "+m" (*offset): "r" (seg) : "memory");
> sync();
> }
> return "gs";
>
> On entry, %gs will contain the normal usermode TLS selector. "seg" is
> another selector allocated with set_thread_area(). The asm pushes the
> old %gs, loads the new one, uses a memory address via the new segment,
> then restores the previous %gs.

but it does not actually use the 'normal usermode TLS selector' - it
only loads it.

a meaningful test would be to allocate two selector values and load and
read+write memory through both of them.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo RemoveThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 15, 2006 7:50 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Ingo Molnar wrote:
> no, that's not what it does. It measures 50000000 switches of the _same_
> selector value, without using any of the selectors in the loop itself.
> I.e. no mixing at all! But when the kernel and userspace uses %gs, it's
> the cost of switching between two selector values of %gs that has to be
> measured. Your code does not measure that at all, AFAICS.
>
I think you're misreading it. This is the inner loop:

for(i = 0; i < COUNT; i++) {
asm volatile("push %%gs; mov %1, %%gs; addl $1, %%gs:%0; popl %%gs"
: "+m" (*offset): "r" (seg) : "memory");
sync();
}
return "gs";

On entry, %gs will contain the normal usermode TLS selector. "seg" is
another selector allocated with set_thread_area(). The asm pushes the
old %gs, loads the new one, uses a memory address via the new segment,
then restores the previous %gs.

So given this output:

"Genuine Intel(R) CPU T2400 @ 1.83GHz" @1000Mhz (6,14,Cool:
ds=7b fs=0 gs=33 ldt=f gdt=3b CPUTIME
[...]

The initial %fs and %gs are 0 and 0x33 respectively, and it is using
0x3b as the other GDT selector (and 0xf as the other LDT selector).

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 8:00 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy.TakeThisOut@goop.org> wrote:

> Ingo Molnar wrote:
> > but it does not actually use the 'normal usermode TLS selector' - it
> > only loads it.
> >
> > a meaningful test would be to allocate two selector values and load and
> > read+write memory through both of them.
> >
>
> Well, obviously in one case it would need to switch between
> null/non-null/null. But yes, good point about using the "usermode"
> %gs each iteration. I'll do some more tests.

i'd not even use glibc's %gs but set up two separate selectors. (that's
a more controlled experiment - someone might run a non-TLS glibc, etc.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 15, 2006 8:00 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Ingo Molnar wrote:
> but it does not actually use the 'normal usermode TLS selector' - it
> only loads it.
>
> a meaningful test would be to allocate two selector values and load and
> read+write memory through both of them.
>

Well, obviously in one case it would need to switch between
null/non-null/null. But yes, good point about using the "usermode" %gs
each iteration. I'll do some more tests.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 8:10 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy DeleteThis @goop.org> wrote:

> Arjan van de Ven wrote:
> > segment register accesses really are not cheap.
> > Also really it'll be better to use the register userspace is not using,
> > but we had that discussion before; could you remind me why you picked
> > %gs in the first place?
> >
>
> To leave open the possibility of using the compiler's TLS support in
> the kernel for percpu. I also measured the cost of reloading %gs vs
> %fs, and found no difference between reloading a null selector vs a
> non-null selector.

what point would there be in using it? It's not like the kernel could
make use of the thread keyword anytime soon (it would need /all/
architectures to support it) ... and the kernel doesnt mind how the
current per_cpu() primitives are implemented, via assembly or via C. In
any case, it very much matters to see the precise cost of having the pda
selector value in %gs versus %fs.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo DeleteThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 15, 2006 8:10 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Ingo Molnar wrote:
> i'd not even use glibc's %gs but set up two separate selectors. (that's
> a more controlled experiment - someone might run a non-TLS glibc, etc.)
>

Well, in that case they probably don't care whether the kernel uses %fs
or %gs Wink

But either way, this doesn't have much bearing on Eric's test; we'd be
only talking about a few ns per kernel exit, rather than 5% for read/write.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.RemoveThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Ingo Molnar
External


Since: May 15, 2006
Posts: 3111



PostPosted: Wed Nov 15, 2006 8:10 pm    Post subject: Re: i386 PDA patches use of %gs [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

* Jeremy Fitzhardinge <jeremy DeleteThis @goop.org> wrote:

> Ingo Molnar wrote:
> > i'd not even use glibc's %gs but set up two separate selectors.
> > (that's a more controlled experiment - someone might run a non-TLS
> > glibc, etc.)
> >
>
> Well, in that case they probably don't care whether the kernel uses
> %fs or %gs Wink
>
> But either way, this doesn't have much bearing on Eric's test; we'd be
> only talking about a few ns per kernel exit, rather than 5% for
> read/write.

if the timings are different then it very much has bearing on the
argument that i made against the current i386 PDA patchset, that mixed
use segments are suboptimal.

So i'm NAK-ing the i386 PDA patchset until this has been properly
measured (and fixed if needed).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo DeleteThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Eric Dumazet
External


Since: May 15, 2006
Posts: 230



PostPosted: Tue Nov 21, 2006 12:40 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Wednesday 15 November 2006 18:46, Eric Dumazet wrote:
> On Wednesday 15 November 2006 18:24, Andi Kleen wrote:
> > On Wednesday 15 November 2006 18:20, Ingo Molnar wrote:
> > > * Andi Kleen <ak.RemoveThis@suse.de> wrote:
> > > > On Wednesday 15 November 2006 12:27, Eric Dumazet wrote:
> > > > > Seeing %gs prefixes used now by i386 port, I recalled seeing
> > > > > strange oprofile results on Opteron machines.
> > > > >
> > > > > I really think %gs prefixes can be expensive in some (most ?)
> > > > > cases, even if the Intel/AMD docs say they are free.
> > > >
> > > > They aren't free, just very cheap.
> > >
> > > Eric's test shows a 5% slowdown. That's far from cheap.
> >
> > I have my doubts about the accuracy of his test results. That is why I
> > asked him to double check.
>
> Fair enough Smile
>
> I plan doing *lot* of tests as soon as possible (not possible during
> daytime unfortunately, I miss a dev machine)
>

I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks
using : pipe/write()/read, umask(), or getppid(), using or not oprofile.

I managed to avoid reloading %gs in sysenter_entry .
(avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs

I could not avoid reloading %gs in system_call, I dont know why, but modern
glibc use sysenter so I dont care Smile

I confirm I got better results with my patched kernel in all tests I've done.

umask : 12.64 s instead of 12.90 s
getppid : 13.37 s instead of 13.72 s
pipe/read/write : 9.10 s instead of 9.52 s

(I got very different results in umask() bench, patching it not to use xchg(),
since this instruction is expensive on x86 and really change oprofile
results. I will submit a patch for this.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.RemoveThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Tue Nov 21, 2006 10:50 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Eric Dumazet wrote:
> I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks
> using : pipe/write()/read, umask(), or getppid(), using or not oprofile.
>
> I managed to avoid reloading %gs in sysenter_entry .
> (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>
> I could not avoid reloading %gs in system_call, I dont know why, but modern
> glibc use sysenter so I dont care Smile
>
> I confirm I got better results with my patched kernel in all tests I've done.
>
> umask : 12.64 s instead of 12.90 s
> getppid : 13.37 s instead of 13.72 s
> pipe/read/write : 9.10 s instead of 9.52 s
>
> (I got very different results in umask() bench, patching it not to use xchg(),
> since this instruction is expensive on x86 and really change oprofile
> results. I will submit a patch for this.
>

Could you go into more detail about what you're actually measuring
here? Is it 10,000,000 loops of the single syscall? pipe/read/write
suggests that you're doing at least 2 syscalls per loop, but it takes
the smallest elapsed time.

What are you using as your time reference? Real time? Process time?

For umask/getppid, assuming you're just running 1e7 iterations, you're
seeing a difference of 25 and 35ns per iteration difference. I wonder
why it would be different for different syscalls; I would expect it to
be a constant overhead either way. Certainly these numbers are much
larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
syscall (getppid) test; I saw an overall 9ns difference in null syscall
time on my Core Duo run at 1GHz. What's your CPU and speed?

One possibility is a cache miss on the gdt while reloading %gs. I've
been planning on a patch to rearrange the gdt in order to pack all the
commonly used segment descriptors into one or two cache lines so that
all the segment register reloads can be done with a minimum of cache
misses. It would be interesting for you to replace the:

movl $(__KERNEL_PDA), %edx; movl %edx, %gs

with an appropriate read of the gdt entry, hm, which is a bit complex to
find.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.RemoveThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Andi Kleen
External


Since: Jul 07, 2006
Posts: 1925



PostPosted: Tue Nov 21, 2006 11:00 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

> For umask/getppid, assuming you're just running 1e7 iterations, you're
> seeing a difference of 25 and 35ns per iteration difference. I wonder
> why it would be different for different syscalls; I would expect it to
> be a constant overhead either way.

They got different numbers of current references?

> Certainly these numbers are much
> larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
> syscall (getppid) test; I saw an overall 9ns difference in null syscall
> time on my Core Duo run at 1GHz. What's your CPU and speed?
>
> One possibility is a cache miss on the gdt while reloading %gs. I've

On such micro benchmarks everything should be cache hot in theory
(unless it's a system with really small cache)

> been planning on a patch to rearrange the gdt in order to pack all the
> commonly used segment descriptors into one or two cache lines so that
> all the segment register reloads can be done with a minimum of cache
> misses. It would be interesting for you to replace the:
>
> movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>
> with an appropriate read of the gdt entry, hm, which is a bit complex to
> find.

On UP it could be hardcoded. And oprofile can be used to profile for cache misses.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo RemoveThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Tue Nov 21, 2006 11:20 pm    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Andi Kleen wrote:
>> For umask/getppid, assuming you're just running 1e7 iterations, you're
>> seeing a difference of 25 and 35ns per iteration difference. I wonder
>> why it would be different for different syscalls; I would expect it to
>> be a constant overhead either way.
>>
>
> They got different numbers of current references?
>

My understanding is that Eric has changed UP current (and other PDA ops)
to not touch %gs at all, and the difference in reported times in due
omitting the %gs load in entry.S (though %gs is still save/restored on
the stack).

> On such micro benchmarks everything should be cache hot in theory
> (unless it's a system with really small cache)
>

Yes, that would be my thought too, but maybe there's excessive aliasing
on one of the ways, but I think he's using a Pentium M which has a 8-way L1.

>> been planning on a patch to rearrange the gdt in order to pack all the
>> commonly used segment descriptors into one or two cache lines so that
>> all the segment register reloads can be done with a minimum of cache
>> misses. It would be interesting for you to replace the:
>>
>> movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>>
>> with an appropriate read of the gdt entry, hm, which is a bit complex to
>> find.
>>
>
> On UP it could be hardcoded. And oprofile can be used to profile for cache misses.
>

Yes, assuming oprofile doesn't interfere with things too much.
Actually, just counting cache miss events during the course of a syscall
would be most interesting (ie, no need to sample).


J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.TakeThisOut@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 22, 2006 12:20 am    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Eric Dumazet wrote:
> for umask/getppid(), its a basic loop with 100.000.000 iterations

Ah, OK, so there's about 2.5-3.5ns difference due to the instructions
you removed. That's very much in line with that I saw in my measurements.

> for read/write(), loop with 10.000.000 iterations

2 syscalls/iteration? It's interesting you measured about the same
absolute time difference (.42s) even though you're doing 1/5th the
number of syscalls.

> elapsed time (/usr/bin/time ./prog)
> 10 runs, and the minimum time is taken.

Hm, but "time" measures user, system and real time. You used real time?

> Hum... Do you mean a cache miss every time we do a syscall ? What
> could invalidate this cache exactly ?

Well, there might be a miss simply because the line got evicted. But as
Andi pointed out, a hot benchmark like this is very unlikely to get any
cache misses unless there's something very unfortunate happening.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo RemoveThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Eric Dumazet
External


Since: May 15, 2006
Posts: 230



PostPosted: Wed Nov 22, 2006 2:40 am    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Jeremy Fitzhardinge a écrit :
> Eric Dumazet wrote:
>> I did *lot* of reboots of my Dell D610 machine, with some trivial benchmarks
>> using : pipe/write()/read, umask(), or getppid(), using or not oprofile.
>>
>> I managed to avoid reloading %gs in sysenter_entry .
>> (avoiding the two instructions : movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>>
>> I could not avoid reloading %gs in system_call, I dont know why, but modern
>> glibc use sysenter so I dont care Smile
>>
>> I confirm I got better results with my patched kernel in all tests I've done.
>>
>> umask : 12.64 s instead of 12.90 s
>> getppid : 13.37 s instead of 13.72 s
>> pipe/read/write : 9.10 s instead of 9.52 s
>>
>> (I got very different results in umask() bench, patching it not to use xchg(),
>> since this instruction is expensive on x86 and really change oprofile
>> results. I will submit a patch for this.
>>
>
> Could you go into more detail about what you're actually measuring
> here? Is it 10,000,000 loops of the single syscall? pipe/read/write
> suggests that you're doing at least 2 syscalls per loop, but it takes
> the smallest elapsed time.

for umask/getppid(), its a basic loop with 100.000.000 iterations
for read/write(), loop with 10.000.000 iterations
>
> What are you using as your time reference? Real time? Process time?
>

elapsed time (/usr/bin/time ./prog)
10 runs, and the minimum time is taken.

> For umask/getppid, assuming you're just running 1e7 iterations, you're
> seeing a difference of 25 and 35ns per iteration difference. I wonder
> why it would be different for different syscalls; I would expect it to
> be a constant overhead either way. Certainly these numbers are much
> larger than I saw when I benchmarked pda-vs-nopda using lmbench's null
> syscall (getppid) test; I saw an overall 9ns difference in null syscall
> time on my Core Duo run at 1GHz. What's your CPU and speed?

Its a 1.6GHz Pentium-M CPU (Dell D610)

>
> One possibility is a cache miss on the gdt while reloading %gs. I've
> been planning on a patch to rearrange the gdt in order to pack all the
> commonly used segment descriptors into one or two cache lines so that
> all the segment register reloads can be done with a minimum of cache
> misses. It would be interesting for you to replace the:
>
> movl $(__KERNEL_PDA), %edx; movl %edx, %gs
>
> with an appropriate read of the gdt entry, hm, which is a bit complex to
> find.
>

Hum... Do you mean a cache miss every time we do a syscall ? What could
invalidate this cache exactly ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo RemoveThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Jeremy Fitzhardinge
External


Since: May 30, 2006
Posts: 1261



PostPosted: Wed Nov 29, 2006 12:20 am    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Eric Dumazet wrote:
> Seeing %gs prefixes used now by i386 port, I recalled seeing strange oprofile
> results on Opteron machines.

Hi Eric,

Could you try this patch out and see if it makes much performance
difference for you. You should apply this on top of the %fs patch I
posted earlier (and use the %fs patch as the baseline for your comparisons).

Thanks,
J

Don't bother with segment references for UP PDA

When compiled for UP, don't bother prefixing PDA references with a
segment override. Also doesn't bother reloading the PDA segment
register (though it still gets saved and restored, because the value
is used elsewhere in the kernel, and the restore is necessary for
correct context switches).

I'm not very keen on the extra #ifdefs this adds, though I've tried to
keep them minimal. Eric Dumazet reports small performance gains from
similar patch however.

Signed-off-by: Jeremy Fitzhardinge <jeremy DeleteThis @xensource.com>
Cc: Ingo Molnar <mingo DeleteThis @elte.hu>
Cc: Andi Kleen <andi DeleteThis @muc.de>
Cc: Eric Dumazet <dada1 DeleteThis @cosmosbay.com>

diff -r 022c29ea754e arch/i386/kernel/cpu/common.c
--- a/arch/i386/kernel/cpu/common.c Tue Nov 21 18:54:56 2006 -0800
+++ b/arch/i386/kernel/cpu/common.c Wed Nov 22 01:54:02 2006 -0800
@@ -628,7 +628,11 @@ static __cpuinit int alloc_gdt(int cpu)
BUG_ON(gdt != NULL || pda != NULL);

gdt = alloc_bootmem_pages(PAGE_SIZE);
+#ifdef CONFIG_SMP
+ pda = &boot_pda;
+#else
pda = alloc_bootmem(sizeof(*pda));
+#endif
/* alloc_bootmem(_pages) panics on failure, so no check */

memset(gdt, 0, PAGE_SIZE);
@@ -661,6 +665,10 @@ struct i386_pda boot_pda = {
.cpu_number = 0,
.pcurrent = &init_task,
};
+#ifndef CONFIG_SMP
+/* boot_pda is used for all PDA access in UP */
+EXPORT_SYMBOL(boot_pda);
+#endif

static inline void set_kernel_fs(void)
{
diff -r 022c29ea754e arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S Tue Nov 21 18:54:56 2006 -0800
+++ b/arch/i386/kernel/entry.S Wed Nov 22 13:38:56 2006 -0800
@@ -97,6 +97,16 @@ 1:
#define resume_userspace_sig resume_userspace
#endif

+#ifdef CONFIG_SMP
+#define LOAD_PDA_SEG(reg) \
+ movl $(__KERNEL_PDA), reg; \
+ movl reg, %fs
+#define CUR_CPU(reg) movl %fs:PDA_cpu, reg
+#else
+#define LOAD_PDA_SEG(reg)
+#define CUR_CPU(reg) movl boot_pda+PDA_cpu, reg
+#endif
+
#define SAVE_ALL \
cld; \
pushl %fs; \
@@ -132,8 +142,7 @@ 1:
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es; \
- movl $(__KERNEL_PDA), %edx; \
- movl %edx, %fs
+ LOAD_PDA_SEG(%edx)

#define RESTORE_INT_REGS \
popl %ebx; \
@@ -546,7 +555,7 @@ syscall_badsys:

#define FIXUP_ESPFIX_STACK \
/* since we are on a wrong stack, we cant make it a C code Sad */ \
- movl %fs:PDA_cpu, %ebx; \
+ CUR_CPU(%ebx); \
PER_CPU(cpu_gdt_descr, %ebx); \
movl GDS_address(%ebx), %ebx; \
GET_DESC_BASE(GDT_ENTRY_ESPFIX_SS, %ebx, %eax, %ax, %al, %ah); \
diff -r 022c29ea754e include/asm-i386/pda.h
--- a/include/asm-i386/pda.h Tue Nov 21 18:54:56 2006 -0800
+++ b/include/asm-i386/pda.h Wed Nov 22 02:35:24 2006 -0800
@@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[];

#define cpu_pda(i) (_cpu_pda[i])

+/* Use boot-time PDA for UP. For SMP we still need to declare it, but
+ it isn't used. */
+extern struct i386_pda boot_pda;
+
+#ifdef CONFIG_SMP
+#define PDA_REF "%%fs:%c[off]"
+#else
+#define PDA_REF "%[mem]"
+#endif
+
#define pda_offset(field) offsetof(struct i386_pda, field)

extern void __bad_pda_field(void);
@@ -33,28 +43,31 @@ extern void __bad_pda_field(void);
clobbers, so gcc can readily analyse them. */
extern struct i386_pda _proxy_pda;

-#define pda_to_op(op,field,val) \
+#define pda_to_op(op,field,_val) \
do { \
typedef typeof(_proxy_pda.field) T__; \
- if (0) { T__ tmp__; tmp__ = (val); } \
+ if (0) { T__ tmp__; tmp__ = (_val); } \
switch (sizeof(_proxy_pda.field)) { \
case 1: \
- asm(op "b %1,%%fs:%c2" \
- : "+m" (_proxy_pda.field) \
- :"ri" ((T__)val), \
- "i"(pda_offset(field))); \
+ asm(op "b %[val]," PDA_REF \
+ : "+m" (_proxy_pda.field), \
+ [mem] "+m" (boot_pda.field) \
+ : [val] "ri" ((T__)_val), \
+ [off] "i" (pda_offset(field))); \
break; \
case 2: \
- asm(op "w %1,%%fs:%c2" \
- : "+m" (_proxy_pda.field) \
- :"ri" ((T__)val), \
- "i"(pda_offset(field))); \
+ asm(op "w %[val]," PDA_REF \
+ : "+m" (_proxy_pda.field), \
+ [mem] "+m" (boot_pda.field) \
+ : [val] "ri" ((T__)_val), \
+ [off] "i" (pda_offset(field))); \
break; \
case 4: \
- asm(op "l %1,%%fs:%c2" \
- : "+m" (_proxy_pda.field) \
- :"ri" ((T__)val), \
- "i"(pda_offset(field))); \
+ asm(op "l %[val]," PDA_REF \
+ : "+m" (_proxy_pda.field), \
+ [mem] "+m" (boot_pda.field) \
+ : [val] "ri" ((T__)_val), \
+ [off] "i" (pda_offset(field))); \
break; \
default: __bad_pda_field(); \
} \
@@ -65,22 +78,25 @@ extern struct i386_pda _proxy_pda;
typeof(_proxy_pda.field) ret__; \
switch (sizeof(_proxy_pda.field)) { \
case 1: \
- asm(op "b %%fs:%c1,%0" \
- : "=r" (ret__) \
- : "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
+ asm(op "b " PDA_REF ",%[ret]" \
+ : [ret] "=r" (ret__) \
+ : [off] "i" (pda_offset(field)), \
+ "m" (_proxy_pda.field), \
+ [mem] "m" (boot_pda.field)); \
break; \
case 2: \
- asm(op "w %%fs:%c1,%0" \
- : "=r" (ret__) \
- : "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
+ asm(op "w " PDA_REF ",%[ret]" \
+ : [ret] "=r" (ret__) \
+ : [off] "i" (pda_offset(field)), \
+ "m" (_proxy_pda.field), \
+ [mem] "m" (boot_pda.field)); \
break; \
case 4: \
- asm(op "l %%fs:%c1,%0" \
- : "=r" (ret__) \
- : "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
+ asm(op "l " PDA_REF ",%[ret]" \
+ : [ret] "=r" (ret__) \
+ : [off] "i" (pda_offset(field)), \
+ "m" (_proxy_pda.field), \
+ [mem] "m" (boot_pda.field)); \
break; \
default: __bad_pda_field(); \
} \


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo DeleteThis @vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Eric Dumazet
External


Since: May 15, 2006
Posts: 230



PostPosted: Wed Nov 29, 2006 10:40 am    Post subject: Re: [PATCH] i386-pda UP optimization [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On Wednesday 29 November 2006 00:12, Jeremy Fitzhardinge wrote:

> Hi Eric,
>
> Could you try this patch out and see if it makes much performance
> difference for you. You should apply this on top of the %fs patch I
> posted earlier (and use the %fs patch as the baseline for your
> comparisons).

Hi Jeremy

I will try this as soon as possible, thank you.

However I have some remarks browsing your patch.


> +#ifdef CONFIG_SMP
> +#define LOAD_PDA_SEG(reg) \
> + movl $(__KERNEL_PDA), reg; \
> + movl reg, %fs
> +#define CUR_CPU(reg) movl %fs:PDA_cpu, reg
> +#else
> +#define LOAD_PDA_SEG(reg)
> +#define CUR_CPU(reg) movl boot_pda+PDA_cpu, reg

if !CONFIG_SMP, why even dereferencing boot_pda+PDA_cpu to get 0 ?
and as PER_CPU(cpu_gdt_descr, %ebx) in !CONFIG_SMP doesnt need the a value in
ebx, you can just do :

#define CUR_CPU(reg) /* nothing */


> --- a/include/asm-i386/pda.h Tue Nov 21 18:54:56 2006 -0800
> +++ b/include/asm-i386/pda.h Wed Nov 22 02:35:24 2006 -0800
> @@ -22,6 +22,16 @@ extern struct i386_pda *_cpu_pda[];
>

My patch was better IMHO : we dont need to force asm () instructions to
perform regular C variable reading/writing in !CONFIG_SMP case.

Using plain C allows compiler to generate a better code.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo.DeleteThis@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Back to top
Display posts from previous:   
Post new topic   General Reply to Topic (not reply to a specific post)    Forums Home -> Kernel (archive) All times are: Eastern Time (US & Canada) (change)
Goto page Previous  1, 2, 3
Page 2 of 3

 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum