Thursday, December 17, 2015

OGG Troubleshooting TCP/IP Errors In Open Systems


To BottomTo Bottom

In this Document
Goal
Solution
References


APPLIES TO:

Oracle GoldenGate - Version 4.0.0 and later
Information in this document applies to any platform.

***Checked for relevance on 21-Mar-2013***

GOAL

Troubleshooting TCP/IP Errors In Open Systems
This note superceeds 966097.1.
Oracle GoldenGate Extract opens TCP/IP connections for two purposes:
To communicate with the Manager or  to send trail data to a Server/Collector process.. Both can return TCP errors.
Case 1.
Starting and stopping Manager from GGSCI gives the following error:

GGSCI > stop mgr
Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)? y

Sending STOP request to MANAGER...
ERROR: opening port for MGR MGR (Could not establish host TCP/IP address).

Case 2.
Error starting Extract:

2006-10-06 11:28:15 GGS WARNING 150 TCP/IP error 146 (Connection refused).
2006-10-06 11:28:25 GGS INFO 406 Socket buffer size set to 27985 (flush size 27985).

2006-10-06 11:28:25 GGS ERROR 150 TCP/IP error 146 (Connection refused); retries exceeded.
2006-10-06 11:28:25 GGS ERROR 190 PROCESS ABENDING.

SOLUTION

Extract may connect to a remote MGR to request that it start a "dynamic" Server/Collector process, in which case the MGR responds with the port number it assigned to the process it started, or Extract may connect directly to a "static" Server/Collector process.
GGS WARNING 150 TCP/IP error 111 (Connection refused)
"Connection refused" indicates one of the following:
No application is running on the remote system is listening to the specified port
The MGR process is not running
The "static" Server/Collector process is not running.
The "dynamic" Server/Collector process was  slow to start (Extract should recover by retrying the connection).
The "dynamic" Server/Collector process failed to start.
The "dynamic" Server/Collector process terminated immediately after starting.
The application running on the remote system is listening to the specified port, but the connection request queue is full
Too many processes are opening connections to the specified port at the same time.
Extract is trying to connect to a Server/Collector port that already has an established connection.
Sometimes, firewalls refuse unauthorized connection requests.
GGS WARNING 150 TCP/IP error 4120 (Connection reset by remote host), IPAddr 192.168.168.192:7890.
"Connection reset" indicates that an established connection has been broken, indicating one of the following:
The application on the other end of the connection terminated (e.g., killed or crashed).
Network problems prevented the TCP/IP protocol stack from receiving a required acknowledgment.
"Connection reset" is rare, and more likely to occur on connections between Extract and a Server/Collector process than for connections between Extract and a MGR process.
Troubleshooting the MGR process on the remote system
Ensure that MGR is running on the remote system:
netstat -n | grep 7809
127.0.0.1.37113 127.0.0.1.7809 32768 0 32768 0 ESTABLISHED
127.0.0.1.7809 127.0.0.1.37113 32768 0 32768 0 ESTABLISHED
GGSCI (remote_system) 1> info mgr
Manager is running (IP port sysname.7890).
If not running, start mgr:
GGSCI (remote_system) 3> start mgr
Manager started.
If running, check that MGR is responding to connection requests and commands:
GGSCI (remote_system) 4> send mgr getportinfo detail 
Sending GETPORTINFO, request to MANAGER ...
Dynamic Port List
Starting Index 0
Reassign Delay 3 seconds
Entry Port Error Process Assigned Program
----- ----- ----- ---------- ------------------- -------
0 7891 0
1 7892 0
2 7893 0
3 7894 0
4 7895 0
5 7896 0
6 7897 0
7 7898 0
8 7899 0
If the comand times out, kill and restart MGR:
$ ps -f | grep ./mgr
gguser  782474 1171604   0 12:46:30  pts/2  0:04 ./mgr mgr PARAMFILE /home/gguser/v10.4.0.19/dirprm/mgr.prm REPORTFILE /home/gguser/v10.4.0.19dirrpt/MGR.rpt PROCESSID MGR PORT 7809 > /home/gguser/v10.4.0.19/dirout/MGR.out 
$ kill -9 782474
$ ggsci
GoldenGate Command Interpreter for DB2
Version 8.0.4.0 Build 024
Copyright GoldenGate Software, Inc.  1995-2006
GGSCI (axe01) 1> start mgr
Manager started.
Troubleshooting MGR connection problems from the local system
After ensuring that MGR is running and responsive on the remote system, check whether a connection can be established.
Using the wrong port number, or when a firewall refuses a connection attempt:
$ telnet remote_system_name 7809
Trying...
telnet: connect: A remote host refused an attempted connect operation.
Using the correct port number:
$ telnet remote_system_name 7890
Trying...
Connected to axe01.
Escape character is '^]'.
^]
telnet> close
Connection closed.
A successful telnet test to the manager port:
ty001279::tcysnc151::TYCONC2:/opt/app/goldengate/dirrpt
>telnet 127.0.0.1 7809
>telnet 127.0.0.1 7809
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Verifying you have the correct manager port:
Be certain that if you have more than one OGG install, you are using the correct manager port and process.
1. Use fuser to check the dirpcs directory of the GoldenGate installation directory:
$ cd ~ggs_home/dirpcs
$ fuser *
This will prompt you for the processid of Manager.
Compare the processid from the results of the following command:
$ ps -ef|grep mgr
If you cannot stop Manager from GGSCI, use:
$ kill -9
3. If that does not work, check the permission of the ggserr.log file. The GoldenGate user should be able to write to this file.
$ ls -l ggserr.log
-rw-rw-rw-
4. Check /etc/hosts file to make sure that it has the right permission. Every group should be able to read this file.
$ ls -l /etc/hosts
-r--r--r-
Troubleshooting Server/Collector connection problems
Refer to Note 965356.1 for more information about troubleshooting DYNAMICPORTLIST problems.
Look for messages related to starting dynamic Server/Collector processes in ggserr.log on the remote system.
2009-05-18 13:56:43  GGS INFO        301  GoldenGate Manager for DB2, mgr.prm:  Command received from EXTRACT on host 192.168.118.59 (START SERVER CPU -1 PRI -1 PARAMS  -c ON).
2009-05-18 13:56:43  GGS INFO        302  GoldenGate Manager for DB2, mgr.prm:  Manager started collector process (Port 7891).
2009-05-18 13:56:43  GGS INFO     373  GoldenGate Collector, port 7891:  Waiting for connection (started dynamically).
If the Server/Collector process (either static or dynamic) is running, but Extract cannot establish a connection, you may attempt to connect to the Server/Collector process using the "telnet" client utility program:
$ telnet remote_server_name 7891
Trying...
telnet: connect: A remote host refused an attempted connect operation.
The above indicates that either the Server/Collector process is not listening to the specified port, or a firewall is refusing the connection request. The "Connection refused" message may be slightly different on different systems, for example:
Trying 192.168.168.192...
telnet: connect to address 192.168.168.192: Connection refused
telnet: Unable to connect to remote host
Successful connections using telnet indicate that the routing is correct and any network firewalls are allowing connections through:
$ telnet remote_system_name 7890
Trying...
Connected to axe01.
Escape character is '^]'.
^]
telnet> close
Connection closed.
Note that the Server/Collector process will terminate when the connection is closed.
Even though telnet may be able to establish a connection, there may still be problems with system software that monitors application activity and blocks connections from unauthorized applications - for example, if "telnet" is "authorized", and "extract" is not, "telnet" can establish connections, but "extract" will be unable to connect:. Whether extract gets "Connection refused" or "Connection timeout" or a different error depends on the software that blocks connections from "unauthorized" programs.
Routing problems and firewalls that drop unauthorized packets may cause a connection timeout:
$ telnet 192.168.168.192 12345
Trying...
telnet: connect: A remote host did not respond within the timeout period.
To check for routing problems, use traceroute, a program that traces the path packets take through the network by setting the "Time-To-Live" (TTL) value to cause the routers along the way to return the "expired" packets; each router that receives the packet decrements the TTL and when it hits zero, that router returns the packet; the traceroute program then displays the "hop count" (TTL value), the address(es) of the router(s) that returned the packet, and the time it took to receive responses from three packets. Timeouts are indicated by an "*" instead of the response time.
An example of a successful traceroute:
$ traceroute remotesys.example.com             
trying to get source for remotesys.example.com
source should be 192.168.119.104
traceroute to remotesys.example.com (10.1.51.102) from 192.168.119.104 (192.168.119.104), 30 hops max
outgoing MTU = 1500
 1  rtr-1-v248.example.com (10.155.248.2)  2 ms  1 ms  1 ms
 2  10.10.1.254 (10.10.1.254)  1 ms  1 ms  1 ms
 3  core-254.example.com (10.10.254.253)  1 ms  1 ms  1 ms
 4  remotesys.example.com (10.1.51.102)  1 ms  1 ms  1 ms
In the following example, a filter in the 3rd hop discards the UDP packets used by traceroute, causing the packets to be silently lost from the perspective of the sender.
$ traceroute 192.168.168.192
trying to get source for 192.168.168.182
source should be 192.168.119.104
traceroute to 192.168.168.182 (192.168.168.192) from 192.168.119.104 (192.168.119.104), 30 hops max
outgoing MTU = 1500
 1  r2.example.com (10.155.248.2)  1 ms  1 ms  1 ms
 2  10.10.1.254 (10.10.1.254)  1 ms  1 ms  1 ms
 3  fw.example.com (10.10.9.254)  1 ms  1 ms  1 ms
 4  * * *
 5  * fw.example.com (10.10.9.254)  16 ms *
 6  * * *
 7  * * *
... (similar lines omitted) ...
28  * * *
29  * * *
30  * * *
If none of the above help identify the problem, contact your network administrator to check firewall settings.
An often overlooked issue is that any error that kills the server collector process appears as a TCP error to the sending extract. If a server does not have write privileges to the trail or if a disk is full, the server dies. A dying server looks like a lost connection to TCP. The user should always verify the ability to write trails as part of the troubleshooting process. This is particularly applicable for the case:
"The "dynamic" Server/Collector process terminated immediately after starting"
NEW V11 ERROR CODES:
With the switch to v11x code, the error scheme and numbering has changed. No cross reference between old and new error codes exists. The following are error codes identified through discovery:
WARNING OGG-01223 TCP/IP error 146 (Connection refused)
The cause and solution have not changed, this error now has a unique error message number. Please see previous reasons at beginning of note for cause and solution
When param files reside withing dbfs filesystem, it is can be a corruption and required to rebuild it

No comments: