Wednesday, January 6, 2016

03. Automated Recovery From TCP/IP Network Errors



Oracle GoldenGate: Automated Recovery From TCP/IP Network Errors

Introduction
There is a little known Oracle GoldenGate (OGG) file, tcperrs, that comes with every release. This file contains settings that specify TCP error handling for OGG processes. In this article we shall present the default settings, present an example of usage, and demonstrate how to perform automated recovery from network related issues.
Main Article
 Here is the tcperrs file that comes standard with every release of OGG:
[lpenton@TPA OGG_East]$ cat tcperrs
# 
# TCP/IP error handling parameters
#
# Default error response is abend
#
# Error           Response    Delay (csecs)    Max Retries
ECONNABORTED      RETRY        1000               10
#ECONNREFUSED     ABORT           0                0
ECONNREFUSED      RETRY        1000               12
ECONNRESET        RETRY         500               10
ENETDOWN          RETRY        3000               50
ENETRESET         RETRY        1000               10
ENOBUFS           RETRY         100               60
ENOTCONN          RETRY         100               10
EPIPE             RETRY         500               10
ESHUTDOWN         RETRY        1000               10
ETIMEDOUT         RETRY        1000               10
NODYNPORTS        RETRY         100               10
TCP/IP errors are unique to each operating system. The tcperrs file lists errors that are common across all operating systems supported by OGG.
In the tcperrs file, the Error column lists the TCP/IP error name, Response details what action to take when the specific error is encountered (ABORT or RETRY), Max Retries denotes how many times the process should attempt to recover from the error, and Delay specifies how long the process should wait between recovery attempts. The delay specification is in centiseconds, which is 1/100th of a second.
Lines beginning with the # character are comments and are ignored when the file is read by OGG processes.
For a better understanding of what all of this means, lets look at one example:
# Error           Response    Delay (csecs)      Max Retries
ECONNREFUSED      RETRY        1000               12
ECONNREFUSED is a fairly common error for new OGG installations. This error means the connection has been refused, and is typically encountered by the OGG Extract Data Pump when it attempts to communicate with a remote OGG Manager. As shown above, the default error handling settings for Extract Data Pump is to attempt twelve times to establish a connection to the target with a ten second delay between attempts. If the connection cannot be established within the 120 second timeframe, the Extract Data Pump will enter the ABEND state and human intervention will be required.
To demonstrate, lets change the ECONNREFUSED settings and attempt to connect an Extract Data Pump to a target OGG instance where no ports are available to receive incoming requests. The tcperrs file setting is to try to establish the connection two times, with a five second delay between attempts.
# Error           Response    Delay (csecs)      Max Retries
ECONNREFUSED      RETRY        500                2
In GGSCI, start the Extract Data Pump.
GGSCI (TPA) 8> start p_abqa

Sending START request to MANAGER ...
EXTRACT P_ABQA starting
The process report file will contain the following showing the connection attempts:
2014-05-22 08:50:56  WARNING OGG-01223  TCP/IP error 111 (Connection refused), endpoint: abq:15051.
2014-05-22 08:51:31  ERROR   OGG-01224  TCP/IP error 111 (Connection refused), 
      endpoint: abq:15051; retries exceeded.
2014-05-22 08:51:31  ERROR   OGG-01668  PROCESS ABENDING.
The system error log will also show the connection attempts:
May 22 08:50:56 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:50:56  
      WARNING OGG-01223  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
      TCP/IP error 111 (Connection refused), endpoint: abq:15051.
May 22 08:51:31 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:51:31  
      ERROR   OGG-01224  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
      TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded.
May 22 08:51:31 localhost Oracle GoldenGate Capture for Oracle[3870]: 2014-05-22 08:51:31  
      ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  PROCESS ABENDING.
From this example, we can see that the tcperrs file may be modified to allow OGG processes to perform automatic recovery from any TCP/IP communication error.
It is important to remember that the tcperrs file is overwritten during the Oracle GoldenGate upgrade process. If you modify the file, make a backup!
Now that we know how to handle TCP/IP errors, lets consider how we can use this along with the OGG ManagerAUTORESTART feature to perform automatic recovery from unanticipated network outages.
The OGG Manager has a parameter option AUTORESTART. This options is used to start one, or more Extract and Replicat processes after they fail. The default setting is to not auto restart and the parameter syntax is:
AUTORESTART  
[, RETRIES ]
[, WAITMINUTES ]
[, RESETMINUTES ]
To demonstrate this use, I have set my Manager process as follows:
autorestart extract p_abqa, retries 5, waitminutes 1, resetminutes 5
This setting will perform a restart of the P_ABQA Extract Data Pump should it go into the ABEND state. The OGG Manager will attempt to start the process five times, delaying one minute between start attempts. After the fifth start attempt, if the process enters the ABEND state, it will be left in that state. After five minutes, the restart counter will be reset to zero and Manager will attempt to start the process again.
Since my tcperrs file is set to perform automated recovery of network issues, my Extract Data Pump will not enter the ABENDstate until all recovery attempts are exhausted.
With these setting, the OGG Manager report file will show the following:
2014-05-22 09:31:44  INFO    OGG-00963  Command received from GGSCI on host 127.0.0.1:37138 
     (START EXTRACT P_ABQA ).
2014-05-22 09:31:44  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:34:45  INFO    OGG-01971  The previous message, 'INFO OGG-00975', repeated 1 times.
2014-05-22 09:34:45  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:37:45  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:37:45  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:40:45  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:40:45  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:43:36  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:43:36  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:46:36  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:46:36  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:52:36  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:52:36  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:55:36  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:55:36  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 09:58:37  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 09:58:37  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 10:01:37  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 10:01:37  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
2014-05-22 10:04:37  INFO    OGG-00975  EXTRACT P_ABQA starting.
2014-05-22 10:04:37  INFO    OGG-00965  EXTRACT P_ABQA restarted automatically.
The system error log will show the following:
May 22 09:31:44 localhost Oracle GoldenGate Command Interpreter for Oracle[3565]: 
     2014-05-22 09:31:44  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  
     GGSCI command (lpenton): start p_abqa.
May 22 09:31:44 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:31:44  
     INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  
     Command received from GGSCI on host 127.0.0.1:37138 (START EXTRACT P_ABQA ).
May 22 09:31:44 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:31:44  
     INFO    OGG-00975  Oracle GoldenGate Manager for Oracle, mgr.prm:  EXTRACT P_ABQA starting.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-00992  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  EXTRACT P_ABQA starting.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-03059  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     Operating system character set identified as UTF-8.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-02695  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     ANSI SQL parameter syntax is used for parameter parsing.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-01851  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     filecaching started: thread ID: 140368060078400.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-01815  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     Virtual Memory Facilities for: COM     anon alloc: mmap(MAP_ANON)  anon free: munmap     
     file alloc: mmap(MAP_SHARED)  file free: munmap     
     target directories:     /home/lpenton/OGG/OGG_East/dirtmp.
May 22 09:31:45 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:31:45  
     INFO    OGG-00993  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  EXTRACT P_ABQA started.
May 22 09:32:15 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:15  
     WARNING OGG-01223  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     TCP/IP error 111 (Connection refused), endpoint: abq:15051.
May 22 09:32:50 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:50  
     ERROR   OGG-01224  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded.
May 22 09:32:50 localhost Oracle GoldenGate Capture for Oracle[4356]: 2014-05-22 09:32:50  
     ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  PROCESS ABENDING.
May 22 09:34:45 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:34:45  
     INFO    OGG-01971  Oracle GoldenGate Manager for Oracle, mgr.prm:  
     The previous message, 'INFO OGG-00975', repeated 1 times.
May 22 09:34:45 localhost Oracle GoldenGate Manager for Oracle[4349]: 2014-05-22 09:34:45  
     INFO    OGG-00965  Oracle GoldenGate Manager for Oracle, mgr.prm:  
     EXTRACT P_ABQA restarted automatically.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-00992  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  EXTRACT P_ABQA starting.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-03059  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     Operating system character set identified as UTF-8.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-02695  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     ANSI SQL parameter syntax is used for parameter parsing.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-01851  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     filecaching started: thread ID: 139742852102464.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-01815  Oracle GoldenGate Capture for Oracle, p_abqa.prm: 
     Virtual Memory Facilities for: COM     anon alloc: mmap(MAP_ANON)  anon free: munmap     
     file alloc: mmap(MAP_SHARED)  file free: munmap     
     target directories:     /home/lpenton/OGG/OGG_East/dirtmp.
May 22 09:34:45 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:34:45  
     INFO    OGG-00993  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  EXTRACT P_ABQA started.
May 22 09:35:15 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:15  
     WARNING OGG-01223  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     TCP/IP error 111 (Connection refused), endpoint: abq:15051.
May 22 09:35:50 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:50  
     ERROR   OGG-01224  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  
     TCP/IP error 111 (Connection refused), endpoint: abq:15051; retries exceeded.
May 22 09:35:50 localhost Oracle GoldenGate Capture for Oracle[4393]: 2014-05-22 09:35:50  
     ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, p_abqa.prm:  PROCESS ABENDING.
By using a combination of tcperrs file and Manager AUTORESTART settings, OGG Extract Data Pumps can recover from any network issue without human intervention.

Summary

In this article we presented the Oracle GoldenGate tcperrs file, demonstrated its use, and showed how Oracle GoldenGate may be customized by end users to provide automated recovery from TCP/IP communications errors.

No comments: