8.0 test suite failures: fork() + time on macOS 10.12+

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

8.0 test suite failures: fork() + time on macOS 10.12+

H William Welliver
I’m running builds of 8.0 to make sure we don’t have any major test failures, and I’ve run into a few problems so far. I’ll put them in separate emails so they are more manageable. If anyone can offer any assistance, that would be most appreciated. I can supply any info needed, up to getting you a logon to the systems in question.

First up, macOS 10.12+ hang on socktest.pike. The 10.11 and earlier do not have this problem, and I haven’t tried running an older binary on a newer OS. The call to gc() in finish() never returns, and according to LLDB:

(lldb) thread list
Process 4746 stopped
* thread #1: tid = 0xa7cc2d, 0x00007fff781f922a libsystem_kernel.dylib`mach_msg_trap + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff781f922a libsystem_kernel.dylib`mach_msg_trap + 10
    frame #1: 0x00007fff781f976c libsystem_kernel.dylib`mach_msg + 60
    frame #2: 0x00007fff781fd05e libsystem_kernel.dylib`clock_get_time + 85
    frame #3: 0x0000000105974667 pike`mach_clock_get_time at rusage.c:633:7
    frame #4: 0x00000001058e713b pike`do_gc(ignored_UNUSED=<unavailable>, explicit_call=<unavailable>) at gc.c:3507:24
    frame #5: 0x00000001059ee723 pike`f_gc(args=<unavailable>) at builtin_functions.c:5126:11
    frame #6: 0x00000001063607b7
    frame #7: 0x00000001058a365a pike`mega_apply [inlined] eval_instruction(pc=<unavailable>) at interpret.c:1711:5
    frame #8: 0x00000001058a3658 pike`mega_apply(type=<unavailable>, args=<unavailable>, arg1=<unavailable>, arg2=<unavailable>) at interpret.c:2695
    frame #9: 0x000000010589c72d pike`apply_svalue(s=<unavailable>, args=<unavailable>) at interpret.c:3158:5
    frame #10: 0x0000000105a31769 pike`got_fd_event(box=0x0000000105de5008, event=937461904) at file.c:368:5
    frame #11: 0x00000001058c901f pike`backend_call_active_callbacks(fd_list=0x00007ffeea373ae8, me_UNUSED=<unavailable>) at backend.cmod:2349:6
    frame #12: 0x00000001058c4839 pike`pdb_low_backend_once(pdb=0x00007fef37e086d0, timeout=0x00007ffeea373fa8) at backend.cmod:4137:11
    frame #13: 0x00000001058c4aec pike`f_PollDeviceBackend_cq__backtick_28_29(args=1) at backend.cmod:4315:5
    frame #14: 0x00000001058a1cbc pike`low_mega_apply(type=APPLY_SVALUE, args=1, arg1=<unavailable>, arg2=<unavailable>) at apply_low.h:221:2
    frame #15: 0x00000001058a2753 pike`jump_opcode_F_CALL_FUNCTION_AND_POP at interpret_functions.h:2452:1
    frame #16: 0x00000001061d4348

Some other information I discovered looking into this:

I tried to add some sleep() in the child process in order to examine the process with dtrace, but the sleep() never returned. Sleep seems to work fine with a pike -e ’sleep(5);’. If I disable the fork(), the test runs successfully.

The following test case demonstrates the problem. The sleep() can be exchanged for gc() and it also hangs.

int main() {
object pid;
  if (mixed err = catch { pid = fork(); }) {
   werror("fork() failed\n");
  } else if (pid) {
    int res = pid->wait();
    werror("child exited.\n");
return 0;
  }

  werror("child\n");
  sleep(2);
  werror("slept\n");
  return 0;
}

bin/pike test2.pike 
child

I don’t quite understand why clock_get_time() would hang like that unless there was some sort of problem with the mach clock service across processes, though it wouldn’t surprise me if that were a problem. What is also interesting is that clock_gettime() is available in 10.12 and newer. According to the manpage, this is POSIX compliant and provides CLOCK_MONOTONIC, which is what is used on some other systems. There is a problem in that _POSIX_MONOTONIC_CLOCK is set to -1, which seems to contradict the man page. Not sure if it makes sense to try that instead, or re-initialize the clock service after the fork?

Bill

Reply | Threaded
Open this post in threaded view
|

Re: 8.0 test suite failures: fork() + time on macOS 10.12+

H William Welliver

On Nov 19, 2020, at 2:43 PM, H William Welliver <[hidden email]> wrote:

I’m running builds of 8.0 to make sure we don’t have any major test failures, and I’ve run into a few problems so far. I’ll put them in separate emails so they are more manageable. If anyone can offer any assistance, that would be most appreciated. I can supply any info needed, up to getting you a logon to the systems in question.

First up, macOS 10.12+ hang on socktest.pike. The 10.11 and earlier do not have this problem, and I haven’t tried running an older binary on a newer OS. The call to gc() in finish() never returns, and according to LLDB:


So, a little more experimentation and it appears that my hunch was correct: mach ports are invalid in the child process, and a call to init_mach_clock()following the fork() seems to restore order. What’s the best approach to make that happen? I see that atfork_child_callback get called in the child process after fork… is that the approved approach?

Bill
Reply | Threaded
Open this post in threaded view
|

Re: 8.0 test suite failures: fork() + time on macOS 10.12+

Martin Nilsson (Coppermist) @ Pike (-) developers forum
>So, a little more experimentation and it appears that my hunch was
>correct: mach ports are invalid in the child process, and a call to
>init_mach_clock()following the fork() seems to restore order. What.FN"s
>the best approach to make that happen? I see that
>atfork_child_callback get called in the child process after fork$B!D(B is
>that the approved approach?

This has now been implemented in 8.1, and socktest.pike no longer
hangs on macOS 11.1.  I can backport this, and other macOS fixes, to
8.0 once they get some more testing.