Edwin & Xinyu's Blog  

查无此人,查有此地

mycat troubleshooting


难得再次执笔

记录一下一次mycat的troubleshooting的过程

何为mycat

DAL  see at https://github.com/MyCATApache/Mycat-Server

处于可控性和社区的活动性,当初我引入mycat时选择了1.4 final版本,并且固化在公司私有git库

问题现象

项目开发人员调试java代码时发现mycat无法响应,现象为执行sql无反馈,卡死

查看确认:

  1. mysql客户端无法登陆
  2. telnet可连接,但无反馈
  3. 系统日志停止刷新
  4. 结合debug日志和oem,并和开发人员确认断点调试某select for update不久就发生mycat无响应
    问题追查
  5. 抓取heapdump和threaddump

  6. heapdump排除内存相关问题
  7. threaddump发现如下日志:
    INFO | jvm 1 | 2016/03/24 10:35:33 | “Timer0” daemon prio=10 tid=0x00007faea0001800 nid=0xa4f waiting for monitor entry [0x00007faed500d000]
    INFO | jvm 1 | 2016/03/24 10:35:33 | java.lang.Thread.State: BLOCKED (on object monitor)
    INFO | jvm 1 | 2016/03/24 10:35:33 | at oracle.jdbc.driver.PhysicalConnection.isClosed(PhysicalConnection.java:2959)
    INFO | jvm 1 | 2016/03/24 10:35:33 | - waiting to lock <0x00000007d64c8e60> (a oracle.jdbc.driver.T4CConnection)
    INFO | jvm 1 | 2016/03/24 10:35:33 | at org.opencloudb.jdbc.JDBCConnection.isClosed(JDBCConnection.java:119)
    INFO | jvm 1 | 2016/03/24 10:35:33 | at org.opencloudb.net.NIOProcessor.backendCheck(NIOProcessor.java:196)

INFO | jvm 1 | 2016/03/24 10:32:15 |
INFO | jvm 1 | 2016/03/24 10:32:15 | “BusinessExecutor1” daemon prio=10 tid=0x00007fae9ca5e000 nid=0xa4e runnable [0x00007faed510d000]
INFO | jvm 1 | 2016/03/24 10:32:15 | java.lang.Thread.State: RUNNABLE
INFO | jvm 1 | 2016/03/24 10:32:15 | at java.net.SocketInputStream.socketRead0(Native Method)
INFO | jvm 1 | 2016/03/24 10:32:15 | at java.net.SocketInputStream.read(SocketInputStream.java:152)
INFO | jvm 1 | 2016/03/24 10:32:15 | at java.net.SocketInputStream.read(SocketInputStream.java:122)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.Packet.receive(Packet.java:311)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.DataPacket.receive(DataPacket.java:105)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:305)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.NetInputStream.read(NetInputStream.java:249)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.NetInputStream.read(NetInputStream.java:171)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.net.ns.NetInputStream.read(NetInputStream.java:89)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4CMAREngineStream.unmarshalUB1(T4CMAREngineStream.java:429)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:397)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)

 

INFO | jvm 1 | 2016/03/24 10:32:15 | “$_NIOREACTOR-1-RW” prio=10 tid=0x00007fae9ca07800 nid=0xa48 waiting for monitor entry [0x00007faed595e000]
INFO | jvm 1 | 2016/03/24 10:32:15 | java.lang.Thread.State: BLOCKED (on object monitor)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.PhysicalConnection.isClosed(PhysicalConnection.java:2959)
INFO | jvm 1 | 2016/03/24 10:32:15 | - waiting to lock <0x00000007d64c8e60> (a oracle.jdbc.driver.T4CConnection)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.jdbc.JDBCConnection.isClosed(JDBCConnection.java:119)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.backend.ConMap.getActiveCountForDs(ConMap.java:106)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.backend.PhysicalDatasource.getActiveCount(PhysicalDatasource.java:290)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.backend.PhysicalDatasource.getConnection(PhysicalDatasource.java:368)

 

INFO | jvm 1 | 2016/03/24 10:32:15 | “$_NIOREACTOR-0-RW” prio=10 tid=0x00007fae9c9d0000 nid=0xa47 waiting for monitor entry [0x00007faed5a5f000]
INFO | jvm 1 | 2016/03/24 10:32:15 | java.lang.Thread.State: BLOCKED (on object monitor)
INFO | jvm 1 | 2016/03/24 10:32:15 | at oracle.jdbc.driver.PhysicalConnection.isClosed(PhysicalConnection.java:2959)
INFO | jvm 1 | 2016/03/24 10:32:15 | - waiting to lock <0x00000007d64c8e60> (a oracle.jdbc.driver.T4CConnection)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.jdbc.JDBCConnection.isClosed(JDBCConnection.java:119)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.backend.ConMap.getActiveCountForDs(ConMap.java:106)
INFO | jvm 1 | 2016/03/24 10:32:15 | at org.opencloudb.backend.PhysicalDatasource.getActiveCount(PhysicalDatasource.java:290)

 

 

可以注意到活动线程被对象0x00000007d64c8e60 block,并确认该对象是数据库查询时execute产生

INFO | jvm 1 | 2016/03/24 10:35:29 | at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1111)
INFO | jvm 1 | 2016/03/24 10:35:29 | at oracle.jdbc.driver.OracleStatement.executeQuery(OracleStatement.java:1309)
INFO | jvm 1 | 2016/03/24 10:35:29 | - locked <0x00000007d64c8e60> (a oracle.jdbc.driver.T4CConnection)
INFO | jvm 1 | 2016/03/24 10:35:29 | at oracle.jdbc.driver.OracleStatementWrapper.executeQuery(OracleStatementWrapper.java:422)
INFO | jvm 1 | 2016/03/24 10:35:29 | at org.opencloudb.jdbc.JDBCConnection.ouputResultSet(JDBCConnection.java:388)
INFO | jvm 1 | 2016/03/24 10:35:29 | at org.opencloudb.jdbc.JDBCConnection.executeSQL(JDBCConnection.java:317)

mycat最后的应用日志如下:

03/24 10:08:02.739 INFO $_NIOREACTOR-1-RW -not ilde connection in pool,create new connection for hostM1 of schema DBCMOLA1
03/24 10:08:02.897 DEBUG $_NIOREACTOR-0-RW -ServerConnection [id=5498, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=true, schema=CMOLA1]SET autocommit=0
03/24 10:08:02.946 DEBUG $_NIOREACTOR-0-RW -ServerConnection [id=5500, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]/!mycat:schema=CMOLA1/SELECT SQ_TT_SALES_PART.nextval AS salePartId FROM dual
03/24 10:08:02.946 DEBUG $_NIOREACTOR-0-RW -SQLRouteCache miss cache ,key:CMOLA1/!mycat:schema=CMOLA1/SELECT SQ_TT_SALES_PART.nextval AS salePartId FROM dual
03/24 10:08:02.946 DEBUG $_NIOREACTOR-0-RW -ServerConnection [id=5500, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]SELECT SQ_TT_SALES_PART.nextval AS salePartId FROM dual, route={
1 -> DNCMOLA1{SELECT SQ_TT_SALES_PART.nextval AS salePartId FROM dual}
} rrs
03/24 10:08:02.946 DEBUG $_NIOREACTOR-0-RW -found connections in session to use JDBCConnection [id=60,autocommit=false,pool=org.opencloudb.jdbc.JDBCDatasource@1a54190, schema=DBCMOLA1, dbType=ORACLE, oldSchema=DBCMOLA1, packetId=18, txIsolation=0, running=false, borrowed=true, host=null, port=-1, con=oracle.jdbc.driver.T4CConnection@7103ac26, respHandler=SingleNodeHandler [node=DNCMOLA1{SELECT SalesPart.SALE_PART_ID AS SalesPart_SALE_PART_ID, SalesPart.DEALER_CODE AS SalesPart_DEALER_CODE, SalesPart.SALES_PART_NO AS SalesPart_SALES_PART_NO, SalesPart.CUSTOMER_NAME AS SalesPart_CUSTOMER_NAME, SalesPart.CONTACT AS SalesPart_CONTACT, SalesPart.SALES_PART_AMOUNT AS SalesPart_SALES_PART_AMOUNT, SalesPart.IS_FINISHED AS SalesPart_IS_FINISHED, SalesPart.FINISHED_DATE AS SalesPart_FINISHED_DATE, SalesPart.FINISHED_BY AS SalesPart_FINISHED_BY, SalesPart.REMARK AS SalesPart_REMARK, SalesPart.UPDATE_DATE AS SalesPart_UPDATE_DATE, SalesPart.UPDATE_BY AS SalesPart_UPDATE_BY, SalesPart.CREATE_DATE AS SalesPart_CREATE_DATE, SalesPart.CREATE_BY AS SalesPart_CREATE_BY FROM TT_SALES_PART SalesPart WHERE SalesPart.SALES_PART_NO = ‘CSO2016030055’ AND SalesPart.DEALER_CODE = ‘2100000’}, packetId=17], attachement=DNCMOLA1{SELECT SalesPart.SALE_PART_ID AS SalesPart_SALE_PART_ID, SalesPart.DEALER_CODE AS SalesPart_DEALER_CODE, SalesPart.SALES_PART_NO AS SalesPart_SALES_PART_NO, SalesPart.CUSTOMER_NAME AS SalesPart_CUSTOMER_NAME, SalesPart.CONTACT AS SalesPart_CONTACT, SalesPart.SALES_PART_AMOUNT AS SalesPart_SALES_PART_AMOUNT, SalesPart.IS_FINISHED AS SalesPart_IS_FINISHED, SalesPart.FINISHED_DATE AS SalesPart_FINISHED_DATE, SalesPart.FINISHED_BY AS SalesPart_FINISHED_BY, SalesPart.REMARK AS SalesPart_REMARK, SalesPart.UPDATE_DATE AS SalesPart_UPDATE_DATE, SalesPart.UPDATE_BY AS SalesPart_UPDATE_BY, SalesPart.CREATE_DATE AS SalesPart_CREATE_DATE, SalesPart.CREATE_BY AS SalesPart_CREATE_BY FROM TT_SALES_PART SalesPart WHERE SalesPart.SALES_PART_NO = ‘CSO2016030055’ AND SalesPart.DEALER_CODE = ‘2100000’}, headerOutputed=false, modifiedSQLExecuted=true, startTime=1458783636089, lastTime=1458785282351, isSpark=false, processor=org.opencloudb.net.NIOProcessor@7b7a12c4] for DNCMOLA1{SELECT SQ_TT_SALES_PART.nextval AS salePartId FROM dual}
03/24 10:08:03.056 DEBUG $_NIOREACTOR-1-RW -ServerConnection [id=5497, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=true, schema=CMOLA1]SET autocommit=0
03/24 10:08:03.078 DEBUG $_NIOREACTOR-0-RW -ServerConnection [id=5498, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]/!mycat:schema=CMOLA1/SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE
03/24 10:08:03.078 DEBUG $_NIOREACTOR-0-RW -SQLRouteCache miss cache ,key:CMOLA1/!mycat:schema=CMOLA1/SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE
03/24 10:08:03.078 DEBUG $_NIOREACTOR-0-RW -ServerConnection [id=5498, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE, route={
1 -> DNCMOLA1{SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE}
} rrs
03/24 10:08:03.106 DEBUG $_NIOREACTOR-1-RW -ServerConnection [id=5497, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]/!mycat:schema=CMOLA1/SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE
03/24 10:08:03.107 DEBUG $_NIOREACTOR-1-RW -SQLRouteCache miss cache ,key:CMOLA1/!mycat:schema=CMOLA1/SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE
03/24 10:08:03.107 DEBUG $_NIOREACTOR-1-RW -ServerConnection [id=5497, schema=CMOLA1, host=10.11.98.179, user=test,txIsolation=3, autocommit=false, schema=CMOLA1]SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE, route={
1 -> DNCMOLA1{SELECT SYSTEM_ORDER_NO_ID, DEALER_CODE, ORDER_PREFIX, YYYY_MM, ORDER_SEQUENCE, CREATE_BY, UPDATE_BY, CREATE_DATE, UPDATE_DATE FROM TC_SYSTEM_ORDER_NO WHERE ORDER_PREFIX = ‘CSO’ AND YYYY_MM = ‘201603’ AND TC_SYSTEM_ORDER_NO.DEALER_CODE = ‘2100000’ FOR UPDATE}
} rrs

 

 

结合dba的oem监控,基本确认即由于debug select for update 语句导致。

why?at org.opencloudb.jdbc.JDBCConnection.isClosed?

具体lock点在 oracle.jdbc.driver.PhysicalConnection.isClosed(PhysicalConnection.java:2959)
waiting to lock <0x00000007d64c8e60> (a oracle.jdbc.driver.T4CConnection)
org.opencloudb.jdbc.JDBCConnection.isClosed(JDBCConnection.java:119)

查下mycat源码,确认此处为包装类判断java.sql.Connection是否closed

@Override
public boolean isClosed() {
try {
return con == null || con.isClosed();
} catch (SQLException e) {
return true;
}
}

由于后端为oracle数据库,查看ojdbc驱动代码,发现了玄机

public synchronized boolean isClosed()
throws SQLException
{
return this.lifecycle != 1;
}

判断是否关闭方法是synchronized的!!!

原因很明确了,因为connection被挂起,导致该方法堵塞,进而后端大量worker的等待

解决方案

坑,反ojdbc风险较大,只能拿native api开刀了,以下方法是可以考虑的

public boolean getAutoCommit()
throws SQLException
{
if (this.lifecycle != 1)
{
SQLException localSQLException = DatabaseError.createSqlException(getConnectionDuringExceptionHandling(), 8);
localSQLException.fillInStackTrace();
throw localSQLException;
}

return this.autocommit;
}

 

最后,哪天有空了再提交github吧,论开源哪些坑,论ojdbc那些坑

validate