I see some failures, at least in gdb.multi/multi-re-run.exp and
gdb.threads/interrupted-hand-call.exp. Running `stress -C $(nproc)` at
the same time as the test makes those tests relatively frequent.
Let's take gdb.multi/multi-re-run.exp as an example. The failure looks
like this, an unexpected "no resumed":
continue
Continuing.
No unwaited-for children left.
(gdb) FAIL: gdb.multi/multi-re-run.exp: re_run_inf=2: iter=1: continue until exit
The situation is:
- Inferior 1 is stopped somewhere, it won't really play a role here.
- Inferior 2 has 2 threads, both stopped.
- We resume inferior 2, the leader thread is expected to exit, making
the process exit.
From GDB's perspective, a failing run looks like this:
[infrun] fetch_inferior_event: enter
[infrun] scoped_disable_commit_resumed: reason=handling event
[infrun] do_target_wait: Found 2 inferiors, starting at #1
[infrun] random_pending_event_thread: None found.
[remote] wait: enter
[remote] Packet received: T0506:20dcffffff7f0000;07:20dcffffff7f0000;10:9551555555550000;thread:pae4cd.ae4cd;core:e;
[remote] wait: exit
[infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
[infrun] print_target_wait_results: 713933.713933.0 [Thread 713933.713933],
[infrun] print_target_wait_results: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
[infrun] handle_inferior_event: status->kind = STOPPED, sig = GDB_SIGNAL_TRAP
[infrun] clear_step_over_info: clearing step over info
[infrun] context_switch: Switching context from 0.0.0 to 713933.713933.0
[infrun] handle_signal_stop: stop_pc=0x555555555195
[infrun] start_step_over: enter
[infrun] start_step_over: stealing global queue of threads to step, length = 0
[infrun] operator(): step-over queue now empty
[infrun] start_step_over: exit
[infrun] process_event_stop_test: no stepping, continue
[remote] Sending packet: $Z0,555555555194,1#8e
[remote] Packet received: OK
[infrun] resume_1: step=0, signal=GDB_SIGNAL_0, trap_expected=0, current thread [713933.713933.0] at 0x555555555195
[remote] Sending packet: $QPassSignals:e;10;14;17;1a;1b;1c;21;24;25;2c;4c;97;#0a
[remote] Packet received: OK
[remote] Sending packet: $vCont;c:pae4cd.-1#9f
[infrun] prepare_to_wait: prepare_to_wait
[infrun] reset: reason=handling event
[infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] fetch_inferior_event: exit
[infrun] fetch_inferior_event: enter
[infrun] scoped_disable_commit_resumed: reason=handling event
[infrun] do_target_wait: Found 2 inferiors, starting at #0
[infrun] random_pending_event_thread: None found.
[remote] wait: enter
[remote] Packet received: N
[remote] wait: exit
[infrun] print_target_wait_results: target_wait (-1.0.0 [process -1], status) =
[infrun] print_target_wait_results: -1.0.0 [process -1],
[infrun] print_target_wait_results: status->kind = NO_RESUMED
[infrun] handle_inferior_event: status->kind = NO_RESUMED
[remote] Sending packet: $Hgp0.0#ad
[remote] Packet received: OK
[remote] Sending packet: $qXfer:threads:read::0,1000#92
[remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
[infrun] stop_waiting: stop_waiting
[remote] Sending packet: $qXfer:threads:read::0,1000#92
[remote] Packet received: l<threads>\n<thread id="pae4cb.ae4cb" core="3" name="multi-re-run-1" handle="40c7c6f7ff7f0000"/>\n<thread id="pae4cb.ae4cc" core="2" name="multi-re-run-1" handle="40b6c6f7ff7f0000"/>\n<thread id="pae4cd.ae4ce" core="1" name="multi-re-run-2" handle="40b6c6f7ff7f0000"/>\n</threads>\n
[infrun] infrun_async: enable=0
[infrun] reset: reason=handling event
[infrun] maybe_set_commit_resumed_all_targets: enabling commit-resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] maybe_call_commit_resumed_all_targets: calling commit_resumed for target extended-remote
[infrun] fetch_inferior_event: exit
We can see that we resume the inferior with vCont;c, but got NO_RESUMED.
When the test passes, we get an EXITED status to indicate the process
has exited.
From GDBserver's point of view, it looks like this. The logs contain
some logging I added and that are part of this patch.
[remote] getpkt: getpkt ("vCont;c:pae4cf.-1"); [no ack sent]
[threads] resume: enter
[threads] thread_needs_step_over: Need step over [LWP 713931]? Ignoring, should remain stopped
[threads] thread_needs_step_over: Need step over [LWP 713932]? Ignoring, should remain stopped
[threads] get_pc: pc is 0x555555555195
[threads] thread_needs_step_over: Need step over [LWP 713935]? No, no breakpoint found at 0x555555555195
[threads] get_pc: pc is 0x7ffff7d35a95
[threads] thread_needs_step_over: Need step over [LWP 713936]? No, no breakpoint found at 0x7ffff7d35a95
[threads] resume: Resuming, no pending status or step over needed
[threads] resume_one_thread: resuming LWP 713935
[threads] proceed_one_lwp: lwp 713935
[threads] resume_one_lwp_throw: continue from pc 0x555555555195
[threads] resume_one_lwp_throw: Resuming lwp 713935 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: NOW ptid=713935.713935.0 stopped=0 resumed=0
[threads] resume_one_thread: resuming LWP 713936
[threads] proceed_one_lwp: lwp 713936
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] resume: exit
[threads] wait_1: enter
[threads] wait_1: [<all threads>]
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
[threads] resume_stopped_resumed_lwps: resuming stopped-resumed LWP LWP 713935.713936 at 7ffff7d35a95: step=0
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 713936 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] operator(): check_zombie_leaders: leader_pid=713931, leader_lp!=NULL=1, num_lwps=2, zombie=0
[threads] operator(): check_zombie_leaders: leader_pid=713935, leader_lp!=NULL=1, num_lwps=2, zombie=1
[threads] operator(): Thread group leader 713935 zombie (it exited, or another thread execd).
[threads] delete_lwp: deleting 713935
[threads] wait_for_event_filtered: exit (no unwaited-for LWP)
sigchld_handler
[threads] wait_1: ret = null_ptid, TARGET_WAITKIND_NO_RESUMED
[threads] wait_1: exit
What happens is:
- We resume the leader (713935) successfully.
- The leader exits.
- We resume the secondary thread (713936), we get ESRCH. This is
expected this the leader has exited.
- resume_one_lwp_throw throws, it's caught by resume_one_lwp.
- resume_one_lwp checks with check_ptrace_stopped_lwp_gone that the
failure can be explained by the LWP becoming zombie, and swallows the
error.
- Note that this means that the secondary lwp still has stopped==1.
- wait_1 is called, probably because linux_process_target::resume marks
the async pipe at the end.
- The exit event isn't ready yet, probably because the machine is under
load, so waitpid returns nothing.
- check_zombie_leaders detects that the leader is zombie and deletes
- We try to find a resumed (non-stopped) LWP to get an event from,
there's none since the leader (that was resumed) is now deleted, and
the secondary thread is still marked stopped.
wait_for_event_filtered returns -1, causing wait_1 to return
NO_RESUMED.
What I notice here is that there is some kind of race between the
availability of the process' exit notification and the call to wait_1
that results from marking the async pipe at the end of resume.
I think what we want from this wait_1 invocation is to keep waiting, as
we will eventually get thread exit notifications for both of our
threads.
The fix I came up with is to mark the secondary thread as !stopped (or
resumed) when we fail to resume it. This makes wait_1 see that there is
at least one resume lwp, so it won't return NO_RESUMED. I think this
makes sense to consider it resumed, because we are going to receive an
exit event for it. Here's the GDBserver logs with the fix applied:
[threads] resume: enter
[threads] thread_needs_step_over: Need step over [LWP 724595]? Ignoring, should remain stopped
[threads] thread_needs_step_over: Need step over [LWP 724596]? Ignoring, should remain stopped
[threads] get_pc: pc is 0x555555555195
[threads] thread_needs_step_over: Need step over [LWP 724597]? No, no breakpoint found at 0x555555555195
[threads] get_pc: pc is 0x7ffff7d35a95
[threads] thread_needs_step_over: Need step over [LWP 724598]? No, no breakpoint found at 0x7ffff7d35a95
[threads] resume: Resuming, no pending status or step over needed
[threads] resume_one_thread: resuming LWP 724597
[threads] proceed_one_lwp: lwp 724597
[threads] resume_one_lwp_throw: continue from pc 0x555555555195
[threads] resume_one_lwp_throw: Resuming lwp 724597 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: NOW ptid=724597.724597.0 stopped=0 resumed=0
[threads] resume_one_thread: resuming LWP 724598
[threads] proceed_one_lwp: lwp 724598
[threads] resume_one_lwp_throw: continue from pc 0x7ffff7d35a95
[threads] resume_one_lwp_throw: Resuming lwp 724598 (continue, signal 0, stop not expected)
[threads] resume_one_lwp_throw: ptrace errno = 3 (No such process)
[threads] resume: exit
[threads] wait_1: enter
[threads] wait_1: [<all threads>]
sigchld_handler
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
[threads] operator(): check_zombie_leaders: leader_pid=724595, leader_lp!=NULL=1, num_lwps=2, zombie=0
[threads] operator(): check_zombie_leaders: leader_pid=724597, leader_lp!=NULL=1, num_lwps=2, zombie=1
[threads] operator(): Thread group leader 724597 zombie (it exited, or another thread execd).
[threads] delete_lwp: deleting 724597
[threads] wait_for_event_filtered: sigsuspend'ing
sigchld_handler
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 724598, ERRNO-OK
[threads] wait_for_event_filtered: waitpid 724598 received 0 (exited)
[threads] filter_event: 724598 exited
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 724597, ERRNO-OK
[threads] wait_for_event_filtered: waitpid 724597 received 0 (exited)
[threads] wait_for_event_filtered: waitpid(-1, ...) returned 0, ERRNO-OK
sigchld_handler
[threads] wait_1: ret = LWP 724597.724598, exited with retcode 0
[threads] wait_1: exit
Change-Id: Idf0bdb4cb0313f1b49e4864071650cc83fb3c100
README for GDBserver & GDBreplay
by Stu Grossman and Fred Fish
Introduction:
This is GDBserver, a remote server for Un*x-like systems. It can be used to
control the execution of a program on a target system from a GDB on a different
host. GDB and GDBserver communicate using the standard remote serial protocol.
They communicate via either a serial line or a TCP connection.
For more information about GDBserver, see the GDB manual:
https://sourceware.org/gdb/current/onlinedocs/gdb/Remote-Protocol.html
Usage (server (target) side):
First, you need to have a copy of the program you want to debug put onto
the target system. The program can be stripped to save space if needed, as
GDBserver doesn't care about symbols. All symbol handling is taken care of by
the GDB running on the host system.
To use the server, you log on to the target system, and run the `gdbserver'
program. You must tell it (a) how to communicate with GDB, (b) the name of
your program, and (c) its arguments. The general syntax is:
target> gdbserver COMM PROGRAM [ARGS ...]
For example, using a serial port, you might say:
target> gdbserver /dev/com1 emacs foo.txt
This tells GDBserver to debug emacs with an argument of foo.txt, and to
communicate with GDB via /dev/com1. GDBserver now waits patiently for the
host GDB to communicate with it.
To use a TCP connection, you could say:
target> gdbserver host:2345 emacs foo.txt
This says pretty much the same thing as the last example, except that we are
going to communicate with the host GDB via TCP. The `host:2345' argument means
that we are expecting to see a TCP connection to local TCP port 2345.
(Currently, the `host' part is ignored.) You can choose any number you want for
the port number as long as it does not conflict with any existing TCP ports on
the target system. This same port number must be used in the host GDB's
`target remote' command, which will be described shortly. Note that if you chose
a port number that conflicts with another service, GDBserver will print an error
message and exit.
On some targets, GDBserver can also attach to running programs. This is
accomplished via the --attach argument. The syntax is:
target> gdbserver --attach COMM PID
PID is the process ID of a currently running process. It isn't necessary
to point GDBserver at a binary for the running process.
Usage (host side):
You need an unstripped copy of the target program on your host system, since
GDB needs to examine it's symbol tables and such. Start up GDB as you normally
would, with the target program as the first argument. (You may need to use the
--baud option if the serial line is running at anything except 9600 baud.)
Ie: `gdb TARGET-PROG', or `gdb --baud BAUD TARGET-PROG'. After that, the only
new command you need to know about is `target remote'. It's argument is either
a device name (usually a serial device, like `/dev/ttyb'), or a HOST:PORT
descriptor. For example:
(gdb) target remote /dev/ttyb
communicates with the server via serial line /dev/ttyb, and:
(gdb) target remote the-target:2345
communicates via a TCP connection to port 2345 on host `the-target', where
you previously started up GDBserver with the same port number. Note that for
TCP connections, you must start up GDBserver prior to using the `target remote'
command, otherwise you may get an error that looks something like
`Connection refused'.
Building GDBserver:
See the `configure.srv` file for the list of host triplets you can build
GDBserver for.
Building GDBserver for your host is very straightforward. If you build
GDB natively on a host which GDBserver supports, it will be built
automatically when you build GDB. You can also build just GDBserver:
% mkdir obj
% cd obj
% path-to-toplevel-sources/configure --disable-gdb
% make all-gdbserver
(If you have a combined binutils+gdb tree, you may want to also
disable other directories when configuring, e.g., binutils, gas, gold,
gprof, and ld.)
If you prefer to cross-compile to your target, then you can also build
GDBserver that way. For example:
% export CC=your-cross-compiler
% path-to-topevel-sources/configure --disable-gdb
% make all-gdbserver
Using GDBreplay:
A special hacked down version of GDBserver can be used to replay remote
debug log files created by GDB. Before using the GDB "target" command to
initiate a remote debug session, use "set remotelogfile <filename>" to tell
GDB that you want to make a recording of the serial or tcp session. Note
that when replaying the session, GDB communicates with GDBreplay via tcp,
regardless of whether the original session was via a serial link or tcp.
Once you are done with the remote debug session, start GDBreplay and
tell it the name of the log file and the host and port number that GDB
should connect to (typically the same as the host running GDB):
$ gdbreplay logfile host:port
Then start GDB (preferably in a different screen or window) and use the
"target" command to connect to GDBreplay:
(gdb) target remote host:port
Repeat the same sequence of user commands to GDB that you gave in the
original debug session. GDB should not be able to tell that it is talking
to GDBreplay rather than a real target, all other things being equal. Note
that GDBreplay echos the command lines to stderr, as well as the contents of
the packets it sends and receives. The last command echoed by GDBreplay is
the next command that needs to be typed to GDB to continue the session in
sync with the original session.