I'm hoping some of you TCP experts out there can help me figure out
what seems to me to be strange behavior with TCP connections between
two processes running on the same computer. The basic problem is
that I'm seeing ETIMEDOUT errors on calls to write() between two
sockets that were moments ago communicating just fine. Based on my
limited understanding of the TCP protocol, I never expected to see a
connection timeout between two processes running on the same box.
The basic setup is a dual CPU system running a fairly large number of
processes (~60) that do most of their communication using TCP
connections. Over a twelve hour period, I might see about a dozen of
these ETIMEDOUT errors. The timeouts tend to occur in processes that
have the highest data rates, but even those are relatively low (~ 1 mb/
sec). I've attached below a snippet from tcpdump that exhibits the
problem. The pattern (the last 6 lines of the attached tcpdump
output) is basically:
server sends packet N
client ACKs packet N
server sends packet N+1
250 msec later server resends packet N+1
client ACKS packet N
server resets the connection
Every instance of this problem I've managed to capture with tcpdump
exhibits this exact same behavior.
I see this behavior in both 2.6.16 and 2.6.18. I'll be trying 2.6.22
next.
So my questions are:
- Under what conditions would you expect to see ETIMEDOUT on a local
TCP connection?
- Are there any kernel parameters I can tweak with sysctl that might
alleviate the problem?
- Can you think of anything I could be doing wrong at the application
level that would cause these timeouts?
Below is a slightly edited fragment from tcpdump. I shortened a few
fields (removed a common prefix from the timestamps and sequence
counters) because I have no idea what google group's web interface
will do re: wrapping long lines.
:48.156263 49597 > 11007: . ack 21464 win 5 <nop,nop,timestamp 4819
4819>
:48.156655 11007 > 49597: P 21464:21624(160) ack 1 win 64
<nop,nop,timestamp 4820 4819>
:48.197066 49597 > 11007: . ack 21624 win 4 <nop,nop,timestamp 4860
451174820>
:48.197083 11007 > 49597: P 21624:23248(1624) ack 1 win 64
<nop,nop,timestamp 4860 4860>
:48.237049 49597 > 11007: . ack 23248 win 1 <nop,nop,timestamp 4900
4860>
:48.674231 11007 > 49597: P 23248:23272(24) ack 1 win 64
<nop,nop,timestamp 5337 4900>
:48.674250 49597 > 11007: . ack 23272 win 1 <nop,nop,timestamp 5337
5337>
:48.674460 11007 > 49597: P 23272:23432(160) ack 1 win 64
<nop,nop,timestamp 5338 5337>
:48.899935 11007 > 49597: P 23272:23432(160) ack 1 win 64
<nop,nop,timestamp 5563 5337>
:50.210641 49597 > 11007: . ack 23272 win 41 <nop,nop,timestamp 6874
5563>
:50.210670 11007 > 49597: R 1517483895:1517483895(0) win 0
Any suggestions are greatly appreciated,
Thanks,
John Filo
filo.RemoveThis@arlut.utexas.edu