Faster calls (again)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Faster calls (again)

Arne Goedeke
Ok, not sure yet if this is going anywhere this time, but I decided to
attempt a second time to refactor the current function call code into
something which is more readable, more flexible and hopefully faster (at
some point). The branch is called 'arne/faster_calls_again'.

The API is similar to what the previous version did, the main difference
being that it allocates pike frames only if needed (not for efuns,
casts, apply_array, etc). The changes in that branch start by first
removing code which was duplicated at some point to increase performance
for certain types of calls. I then added a series of new functions to
initialize, execute and deinitialize a callsite.
Then, I started to use it in some of the existing function call APIs and
also implemented that optimized version of f_map.

The code is probably broken in interesting ways, I have not tested it
very well, yet. It compiles and installs.

Otherwise on my list right now:

* cleanup
* reduce API granularity somewhat
* re-add tracing and dtrace code (could all be in one place now)
* optimize automap
* try to optimize apply_array (e.g. when calling the same function in
  different instances of the same class)
* make tail recursion not allocate a new frame, instead just re-use the
  current one

Feedback and help welcome,

Arne
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Peter Bortas-2
On Sat, Feb 18, 2017 at 12:07 PM, Arne Goedeke <[hidden email]> wrote:

> Ok, not sure yet if this is going anywhere this time, but I decided to
> attempt a second time to refactor the current function call code into
> something which is more readable, more flexible and hopefully faster (at
> some point). The branch is called 'arne/faster_calls_again'.
>
> The API is similar to what the previous version did, the main difference
> being that it allocates pike frames only if needed (not for efuns,
> casts, apply_array, etc). The changes in that branch start by first
> removing code which was duplicated at some point to increase performance
> for certain types of calls. I then added a series of new functions to
> initialize, execute and deinitialize a callsite.
> Then, I started to use it in some of the existing function call APIs and
> also implemented that optimized version of f_map.
>
> The code is probably broken in interesting ways, I have not tested it
> very well, yet. It compiles and installs.

I took this as an opportunity to upgrade and modernize the Coverity
Scan machine. It's now running Ubuntu 16.04 instead of 12.04 and using
Scan 8.7.0, up from 7.7.0.4. Most importantly it's now much easier to
automate and switch which branch is running in the Pike-experiment
project and it's now running against the arne/faster_calls_again
branch.

There has been two builds in that project today. The first one was
against 8.1 git to get a baseline, and the second against you branch.
The only new complaint was CID 1400860.

As with the rest of the projects it will build and analyze once per
day if something has changed. Like the other projects Scan starts to
reject our builds after about 3 builds per week though, and I just
used up two.

Regards,
--
Peter Bortas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Martin Karlgren
In reply to this post by Arne Goedeke
Hi Arne,

That’s awesome!

I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?

Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why.
Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent.
Examples from a very simple Roxen site being hit by apachebench:
http://marty.se/dotgraph.png (nodes with a “perf-17628.map” header represent Pike functions)
http://marty.se/flamegraph.svg (time on horisontal axis, stack depth on vertical axis).

Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.

/Marty
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Arne Goedeke
Hi Marty,

thanks!

Yes, low_mega_apply still needs to be refactored. It is slightly more
"complicated" because of APPLY_STACK, where the return value will
overwrite the function on the stack. I want to fix the last crash in the
testsuite before refactoring that. If you are interested in working on
those, just let me know so we don't both do it ;)

Adding more perf support would be great, do you have your code in a
branch somewhere? I would be interested to have a look at it.

Arne

On 02/20/17 23:47, Martin Karlgren wrote:

> Hi Arne,
>
> That’s awesome!
>
> I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?
>
> Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why.
> Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent.
> Examples from a very simple Roxen site being hit by apachebench:
> http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a “perf-17628.map” header represent Pike functions)
> http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on horisontal axis, stack depth on vertical axis).
>
> Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.
>
> /Marty
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Martin Karlgren
Hi Arne,

Alright. Any idea what the crash might be related to?

I’ve pushed the marty/call_frames branch now. As mentioned, something breaks when precompiled bytecode is decoded, so many testsuite tests will segfault (since they are precompiled).

Compiling --with-mc-stack-frames and running the very nice Debug.generate_perf_map() (previously implemented by TobiJ) should enable perf to extract what’s needed. I’ve used https://github.com/jrfonseca/gprof2dot and http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html for visualisation.

/Marty

On 21 Feb 2017, at 20:31 , Arne Goedeke <[hidden email]> wrote:

Hi Marty,

thanks!

Yes, low_mega_apply still needs to be refactored. It is slightly more
"complicated" because of APPLY_STACK, where the return value will
overwrite the function on the stack. I want to fix the last crash in the
testsuite before refactoring that. If you are interested in working on
those, just let me know so we don't both do it ;)

Adding more perf support would be great, do you have your code in a
branch somewhere? I would be interested to have a look at it.

Arne

On 02/20/17 23:47, Martin Karlgren wrote:
Hi Arne,

That’s awesome!

I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?

Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why.
Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent.
Examples from a very simple Roxen site being hit by apachebench:
http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a “perf-17628.map” header represent Pike functions)
http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on horisontal axis, stack depth on vertical axis).

Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.

/Marty



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Arne Goedeke
I am not quite sure, since I did not have the time to look into it, yet.
My feeling is that callsite_reset() is currently broken, probably  when
trampolines are used. Its probably easy to fix. I was also planning to
write a couple of tests which try to cover all possible paths of the
function call API. Having to run the full testsuite can be quite annoying..

I also started adding some benchmarks for function calls to the
pike-benchmark repo. That might make it easier to tweak specific
optimizations.

Arne

On 02/21/17 22:12, Martin Karlgren wrote:

> Hi Arne,
>
> Alright. Any idea what the crash might be related to?
>
> I’ve pushed the marty/call_frames branch now. As mentioned, something breaks when precompiled bytecode is decoded, so many testsuite tests will segfault (since they are precompiled).
>
> Compiling --with-mc-stack-frames and running the very nice Debug.generate_perf_map() (previously implemented by TobiJ) should enable perf to extract what’s needed. I’ve used https://github.com/jrfonseca/gprof2dot <https://github.com/jrfonseca/gprof2dot> and http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html> for visualisation.
>
> /Marty
>
>> On 21 Feb 2017, at 20:31 , Arne Goedeke <[hidden email]> wrote:
>>
>> Hi Marty,
>>
>> thanks!
>>
>> Yes, low_mega_apply still needs to be refactored. It is slightly more
>> "complicated" because of APPLY_STACK, where the return value will
>> overwrite the function on the stack. I want to fix the last crash in the
>> testsuite before refactoring that. If you are interested in working on
>> those, just let me know so we don't both do it ;)
>>
>> Adding more perf support would be great, do you have your code in a
>> branch somewhere? I would be interested to have a look at it.
>>
>> Arne
>>
>> On 02/20/17 23:47, Martin Karlgren wrote:
>>> Hi Arne,
>>>
>>> That’s awesome!
>>>
>>> I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?
>>>
>>> Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why.
>>> Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent.
>>> Examples from a very simple Roxen site being hit by apachebench:
>>> http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a “perf-17628.map” header represent Pike functions)
>>> http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on horisontal axis, stack depth on vertical axis).
>>>
>>> Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.
>>>
>>> /Marty
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Arne Goedeke
I think I managed to fix the last issue. I was somehow confusing things
and removed the locals from the stack before unlinking the stack frame.
This of course broke trampolines. I also ended up rebasing the branch to
get rid of the reverts I did at some point.

The current state passes the testsuite (the same tests as 8.1 at least).
Performance wise it is roughly where 8.1 is, except for map/automap
being significantly faster. There are some slowdowns currently, which
are due to me removing some fast paths from the F_CALL_OTHER opcode. I
will look into that.

I readded most of the tracing code, however, some of it is unfinished
and DTrace is probably broken. I have also not looked at PROFILING, yet,
that is probably also not right yet.

Sidenote: Profiling unfortunately does not work properly when fork()ing
because timers change. It might even crash when running with debug mode
because of that. But that is probably just a bug we need to fix.

Whats currently left on my list before proposing to merge it into 8.1/8.3

* Make sure the map/automap optimizations do not break in pathological
  cases (e.g. objects being destructed or similar).
* Maybe think about the API again (e.g. callsite_execute and
  callsite_return could be merged. same with
  callsite_init/callsite_set_args).

Otherwise I played around with adding frame caching to apply_array,
which looks promising performance wise. However, it takes some attention
to make sure the stack traces are always correct. This would be a good
test-case for caching frames in general.

Anyway, feedback welcome, as usual,

Arne


On 02/22/17 09:37, Arne Goedeke wrote:

> I am not quite sure, since I did not have the time to look into it, yet.
> My feeling is that callsite_reset() is currently broken, probably  when
> trampolines are used. Its probably easy to fix. I was also planning to
> write a couple of tests which try to cover all possible paths of the
> function call API. Having to run the full testsuite can be quite annoying..
>
> I also started adding some benchmarks for function calls to the
> pike-benchmark repo. That might make it easier to tweak specific
> optimizations.
>
> Arne
>
> On 02/21/17 22:12, Martin Karlgren wrote:
>> Hi Arne,
>>
>> Alright. Any idea what the crash might be related to?
>>
>> I’ve pushed the marty/call_frames branch now. As mentioned, something breaks when precompiled bytecode is decoded, so many testsuite tests will segfault (since they are precompiled).
>>
>> Compiling --with-mc-stack-frames and running the very nice Debug.generate_perf_map() (previously implemented by TobiJ) should enable perf to extract what’s needed. I’ve used https://github.com/jrfonseca/gprof2dot <https://github.com/jrfonseca/gprof2dot> and http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html> for visualisation.
>>
>> /Marty
>>
>>> On 21 Feb 2017, at 20:31 , Arne Goedeke <[hidden email]> wrote:
>>>
>>> Hi Marty,
>>>
>>> thanks!
>>>
>>> Yes, low_mega_apply still needs to be refactored. It is slightly more
>>> "complicated" because of APPLY_STACK, where the return value will
>>> overwrite the function on the stack. I want to fix the last crash in the
>>> testsuite before refactoring that. If you are interested in working on
>>> those, just let me know so we don't both do it ;)
>>>
>>> Adding more perf support would be great, do you have your code in a
>>> branch somewhere? I would be interested to have a look at it.
>>>
>>> Arne
>>>
>>> On 02/20/17 23:47, Martin Karlgren wrote:
>>>> Hi Arne,
>>>>
>>>> That’s awesome!
>>>>
>>>> I’d love to help (with the limited spare time I have.) I guess low_mega_apply should be refactored to make use of the new API too?
>>>>
>>>> Speaking of faster calls, I’ve incidentally been poking around a bit with machine code function calling conventions lately. For profiling purposes (i.e. Linux perf) I’ve added minimal call frame information to Pike functions in the amd64 machine code generator. I’ve gotten to the point where I can start Roxen and get proper stack traces from perf, but the testsuite still fails – it seems related to decoding of dumped bytecode, and I haven’t been able to sort out why.
>>>> Anyways, the good thing is that readymade visualisation tools built on perf output can be used to profile Pike code, and the interaction between Pike code and C functions is more apparent.
>>>> Examples from a very simple Roxen site being hit by apachebench:
>>>> http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a “perf-17628.map” header represent Pike functions)
>>>> http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on horisontal axis, stack depth on vertical axis).
>>>>
>>>> Hopefully this can be used to weed out where we should start looking for optimisation candidates eventually.
>>>>
>>>> /Marty
>>>>
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Martin Karlgren
Hi Arne,

Great news! I can verify that Roxen seems to run just fine on it, too.

One thing I noticed is that it doesn’t seem to compile --with-debug: the PIKE_NEEDS_TRACE macro (defined in interpret.c) can’t be resolved in interpret_functions.h. I’m not sure where it should reside instead.

I’d definitely vote for a merge to 8.1 – there’s no feature freeze in place yet, is it?
After merge, I’d like to rebase/merge the call_frames branch on top of this, unless someone disagrees. That should enable further profiling/optimization iterations.

Best regards,
/Marty

> On 8 Mar 2017, at 10:33 , Arne Goedeke <[hidden email]> wrote:
>
> I think I managed to fix the last issue. I was somehow confusing things
> and removed the locals from the stack before unlinking the stack frame.
> This of course broke trampolines. I also ended up rebasing the branch to
> get rid of the reverts I did at some point.
>
> The current state passes the testsuite (the same tests as 8.1 at least).
> Performance wise it is roughly where 8.1 is, except for map/automap
> being significantly faster. There are some slowdowns currently, which
> are due to me removing some fast paths from the F_CALL_OTHER opcode. I
> will look into that.
>
> I readded most of the tracing code, however, some of it is unfinished
> and DTrace is probably broken. I have also not looked at PROFILING, yet,
> that is probably also not right yet.
>
> Sidenote: Profiling unfortunately does not work properly when fork()ing
> because timers change. It might even crash when running with debug mode
> because of that. But that is probably just a bug we need to fix.
>
> Whats currently left on my list before proposing to merge it into 8.1/8.3
>
> * Make sure the map/automap optimizations do not break in pathological
>  cases (e.g. objects being destructed or similar).
> * Maybe think about the API again (e.g. callsite_execute and
>  callsite_return could be merged. same with
>  callsite_init/callsite_set_args).
>
> Otherwise I played around with adding frame caching to apply_array,
> which looks promising performance wise. However, it takes some attention
> to make sure the stack traces are always correct. This would be a good
> test-case for caching frames in general.
>
> Anyway, feedback welcome, as usual,
>
> Arne

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Faster calls (again)

Peter Bortas-2
On Sun, Mar 19, 2017 at 4:53 PM, Martin Karlgren <[hidden email]> wrote:
> I’d definitely vote for a merge to 8.1 – there’s no feature freeze in place yet, is it?

8.2 isn't forked out yet, so there is no freeze. The fork will
probably happen after Easter, but a freeze will not be put in place
immediately after that.

Regards,
--
Peter Bortas,, NSC
Loading...