Oracle GoldenGate: Automated Recovery From TCP/IP Network Errors
Introduction
There is a little known Oracle GoldenGate (OGG) file, tcperrs, that comes with every release. This file contains settings that specify TCP error handling for OGG processes. In this article we shall present the default settings, present an example of usage, and demonstrate how to perform automated recovery from network related issues.
Main Article
Here is the tcperrs file that comes standard with every release of OGG:
[lpenton@TPA OGG_East]$ cat tcperrs # # TCP/IP error handling parameters # # Default error response is abend # # Error Response Delay (csecs) Max Retries ECONNABORTED RETRY 1000 10 #ECONNREFUSED ABORT 0 0 ECONNREFUSED RETRY 1000 12 ECONNRESET RETRY 500 10 ENETDOWN RETRY 3000 50 ENETRESET RETRY 1000 10 ENOBUFS RETRY 100 60 ENOTCONN RETRY 100 10 EPIPE RETRY 500 10 ESHUTDOWN RETRY 1000 10 ETIMEDOUT RETRY 1000 10 NODYNPORTS RETRY 100 10
TCP/IP errors are unique to each operating system. The tcperrs file lists errors that are common across all operating systems supported by OGG.
In the tcperrs file, the Error column lists the TCP/IP error name, Response details what action to take when the specific error is encountered (ABORT or RETRY), Max Retries denotes how many times the process should attempt to recover from the error, and Delay specifies how long the process should wait between recovery attempts. The delay specification is in centiseconds, which is 1/100th of a second.
Lines beginning with the # character are comments and are ignored when the file is read by OGG processes.
For a better understanding of what all of this means, lets look at one example:
# Error Response Delay (csecs) Max Retries ECONNREFUSED RETRY 1000 12
ECONNREFUSED is a fairly common error for new OGG installations. This error means the connection has been refused, and is typically encountered by the OGG Extract Data Pump when it attempts to communicate with a remote OGG Manager. As shown above, the default error handling settings for Extract Data Pump is to attempt twelve times to establish a connection to the target with a ten second delay between attempts. If the connection cannot be established within the 120 second timeframe, the Extract Data Pump will enter the ABEND state and human intervention will be required.
To demonstrate, lets change the ECONNREFUSED settings and attempt to connect an Extract Data Pump to a target OGG instance where no ports are available to receive incoming requests. The tcperrs file setting is to try to establish the connection two times, with a five second delay between attempts.
# Error Response Delay (csecs) Max Retries ECONNREFUSED RETRY 500 2
In GGSCI, start the Extract Data Pump.
GGSCI (TPA) 8> start p_abqa Sending START request to MANAGER ... EXTRACT P_ABQA starting
The process report file will contain the following showing the connection attempts:
2014-05-22 08:50:56 WARNING OGG-01223 TCP/IP error 111 (Connection refused), endpoint: abq:15051. 2014-05-22 08:51:31 ERROR OGG-01224 TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded. 2014-05-22 08:51:31 ERROR OGG-01668 PROCESS ABENDING.
The system error log will also show the connection attempts:
May 22 08:50:56 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:50:56 WARNING OGG-01223 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051. May 22 08:51:31 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:51:31 ERROR OGG-01224 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded. May 22 08:51:31 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:51:31 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, p_abqa.prm: PROCESS ABENDING.
From this example, we can see that the tcperrs file may be modified to allow OGG processes to perform automatic recovery from any TCP/IP communication error.
It is important to remember that the tcperrs file is overwritten during the Oracle GoldenGate upgrade process. If you modify the file, make a backup!
Now that we know how to handle TCP/IP errors, lets consider how we can use this along with the OGG ManagerAUTORESTART feature to perform automatic recovery from unanticipated network outages.
The OGG Manager has a parameter option AUTORESTART. This options is used to start one, or more Extract and Replicat processes after they fail. The default setting is to not auto restart and the parameter syntax is:
AUTORESTART[, RETRIES ] [, WAITMINUTES ] [, RESETMINUTES ]
To demonstrate this use, I have set my Manager process as follows:
autorestart extract p_abqa, retries 5, waitminutes 1, resetminutes 5
This setting will perform a restart of the P_ABQA Extract Data Pump should it go into the ABEND state. The OGG Manager will attempt to start the process five times, delaying one minute between start attempts. After the fifth start attempt, if the process enters the ABEND state, it will be left in that state. After five minutes, the restart counter will be reset to zero and Manager will attempt to start the process again.
Since my tcperrs file is set to perform automated recovery of network issues, my Extract Data Pump will not enter the ABENDstate until all recovery attempts are exhausted.
With these setting, the OGG Manager report file will show the following:
2014-05-22 09:31:44 INFO OGG-00963 Command received from GGSCI on host 127.0.0.1:37138 (START EXTRACT P_ABQA ). 2014-05-22 09:31:44 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:34:45 INFO OGG-01971 The previous message, 'INFO OGG-00975', repeated 1 times. 2014-05-22 09:34:45 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:37:45 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:37:45 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:40:45 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:40:45 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:43:36 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:43:36 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:46:36 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:46:36 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:52:36 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:52:36 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:55:36 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:55:36 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 09:58:37 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 09:58:37 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 10:01:37 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 10:01:37 INFO OGG-00965 EXTRACT P_ABQA restarted automatically. 2014-05-22 10:04:37 INFO OGG-00975 EXTRACT P_ABQA starting. 2014-05-22 10:04:37 INFO OGG-00965 EXTRACT P_ABQA restarted automatically.
The system error log will show the following:
May 22 09:31:44 localhost Oracle GoldenGate Command Interpreter for Oracle[3565]: 2014-05-22 09:31:44 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (lpenton): start p_abqa. May 22 09:31:44 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:31:44 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from GGSCI on host 127.0.0.1:37138 (START EXTRACT P_ABQA ). May 22 09:31:44 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:31:44 INFO OGG-00975 Oracle GoldenGate Manager for Oracle, mgr.prm: EXTRACT P_ABQA starting. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-00992 Oracle GoldenGate Capture for Oracle, p_abqa.prm: EXTRACT P_ABQA starting. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-03059 Oracle GoldenGate Capture for Oracle, p_abqa.prm: Operating system character set identified as UTF-8. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-02695 Oracle GoldenGate Capture for Oracle, p_abqa.prm: ANSI SQL parameter syntax is used for parameter parsing. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-01851 Oracle GoldenGate Capture for Oracle, p_abqa.prm: filecaching started: thread ID: 140368060078400. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-01815 Oracle GoldenGate Capture for Oracle, p_abqa.prm: Virtual Memory Facilities for: COM anon alloc: mmap(MAP_ANON) anon free: munmap file alloc: mmap(MAP_SHARED) file free: munmap target directories: /home/lpenton/OGG/OGG_East/dirtmp. May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, p_abqa.prm: EXTRACT P_ABQA started. May 22 09:32:15 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:15 WARNING OGG-01223 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051. May 22 09:32:50 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:50 ERROR OGG-01224 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded. May 22 09:32:50 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:50 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, p_abqa.prm: PROCESS ABENDING. May 22 09:34:45 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:34:45 INFO OGG-01971 Oracle GoldenGate Manager for Oracle, mgr.prm: The previous message, 'INFO OGG-00975', repeated 1 times. May 22 09:34:45 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:34:45 INFO OGG-00965 Oracle GoldenGate Manager for Oracle, mgr.prm: EXTRACT P_ABQA restarted automatically. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-00992 Oracle GoldenGate Capture for Oracle, p_abqa.prm: EXTRACT P_ABQA starting. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-03059 Oracle GoldenGate Capture for Oracle, p_abqa.prm: Operating system character set identified as UTF-8. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-02695 Oracle GoldenGate Capture for Oracle, p_abqa.prm: ANSI SQL parameter syntax is used for parameter parsing. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-01851 Oracle GoldenGate Capture for Oracle, p_abqa.prm: filecaching started: thread ID: 139742852102464. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-01815 Oracle GoldenGate Capture for Oracle, p_abqa.prm: Virtual Memory Facilities for: COM anon alloc: mmap(MAP_ANON) anon free: munmap file alloc: mmap(MAP_SHARED) file free: munmap target directories: /home/lpenton/OGG/OGG_East/dirtmp. May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, p_abqa.prm: EXTRACT P_ABQA started. May 22 09:35:15 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:15 WARNING OGG-01223 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051. May 22 09:35:50 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:50 ERROR OGG-01224 Oracle GoldenGate Capture for Oracle, p_abqa.prm: TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded. May 22 09:35:50 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:50 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, p_abqa.prm: PROCESS ABENDING.
By using a combination of tcperrs file and Manager AUTORESTART settings, OGG Extract Data Pumps can recover from any network issue without human intervention.
Summary
In this article we presented the Oracle GoldenGate tcperrs file, demonstrated its use, and showed how Oracle GoldenGate may be customized by end users to provide automated recovery from TCP/IP communications errors.
No comments:
Post a Comment