Thursday, December 17, 2015

OGG : Why are the TCP retries Longer Than the Time Specified in TCPERRs ?

To BottomTo Bottom

In this Document
Symptoms
Cause
Solution
References


APPLIES TO:

Oracle GoldenGate - Version 5.0.0 and later
Information in this document applies to any platform.

SYMPTOMS

  • Datapump is getting TCP errors. It is not retrying at the interval specified in TCPERRS.
    Note the approximately 30 second delays in the error messages below:
2011-05-23 16:56:53 WARNING OGG-01223 TCP/IP error 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.).

2011-05-23 16:57:25 WARNING OGG-01223 TCP/IP error 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.).

2011-05-23 16:57:56 WARNING OGG-01223 TCP/IP error 10060 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.).
  • The TCPERRs file specifies 10 seconds as the wait interval for this error (and 3 retries).
  • Note that this is a sample. Similar behaviour is expected with other TCP error return codes.
#
# TCP/IP error handling parameters
#
# Default error response is abend
#

# Error Response Delay (csecs) Max Retries

ECONNABORTED RETRY 1000 10
#ECONNREFUSED ABEND 0 0
ECONNREFUSED RETRY 1000 12
ECONNRESET RETRY 500 10
ENETDOWN RETRY 1000 3
ENETRESET RETRY 1000 10
ENOBUFS RETRY 100 60
ENOTCONN RETRY 100 10
EPIPE RETRY 500 10
ESHUTDOWN RETRY 1000 10
ETIMEDOUT RETRY 1000 10
NODYNPORTS RETRY 100 10

CAUSE

This is dependent on the TCP timeout and retries before it reports the error message to OGG.

SOLUTION

The time between extract error messages includes the amount of time TCP/IP waits before returning ETIMEDOUT to extract PLUS the time extract delays.

The sequence of events would be something like ...

1) issue the socket operation
2) after some period of time tcp/ip returns ETIMEDOUT
3) extract logs the message
4) if retries are exceeded
abend
else
extract delays the specified amount of time
5) back to step 1

This suggests that the datapump sends a message, TCP internally tries, (maybe retries), and eventually times out.
The pump gets the timeout (in the current example TCP retries after 10 and 20 seconds -> 0,10,20 = 3 retries), waits 10 seconds, and retries.
The cumulative time is about 30 seconds.

No comments: