控制文件BUG导致Oracle数据库异常宕机

作者:赵象如于 2018年09月18日 发布在分类 / 经验案例 / 经验案例 下,并于 2018年09月18日编辑

Oracle 11.2.0.2；controlfile might be corrupted；Oracle bug；database

一、组网图

不涉及

二、问题描述

AIX/Linux/VMware ESX等操作系统上运行的Oracle数据库（11.2.0.2及以后的企业版）在运行过程中突然异常宕机，如果是Oracle RAC环境其中一个节点异常宕机会同时触发另一个节点宕机，进而造成业务中断。

三、过程分析

1、操作系统日志分析，在Oracle 数据库出现异常宕机的时间点，服务器上未出现磁盘和IO异常，以及FC链路的报错打印。

2、通过存储日志分析，在Oracle 数据库出现异常宕机的时间点，存储两个控制器均未出现I/O超时异常打印，且存储FC链路无丢帧、超时和断开重连等现象。

3、通过数据库的alert日志分析，异常宕机都是由于oracle 控制文件bug（ BUG:14281768 - CONTROL FILE GETS CORRUPTED）导致，日志如下所示：

Tue Jul 10 02:30:23 2018

CJQ0 started with pid=1318, OS id=30278296

Tue Jul 10 02:47:39 2018

********************* ATTENTION: ********************

The controlfile header block returned by the OS

has a sequence number that is too old.

The controlfile might be corrupted.

PLEASE DO NOT ATTEMPT TO START UP THE INSTANCE

without following the steps below.

RE-STARTING THE INSTANCE CAN CAUSE SERIOUS DAMAGE

TO THE DATABASE, if the controlfile is truly corrupted.

In order to re-start the instance safely,

please do the following:

(1) Save all copies of the controlfile for later

analysis and contact your OS vendor and Oracle support.

(2) Mount the instance and issue:

ALTER DATABASE BACKUP CONTROLFILE TO TRACE;

(3) Unmount the instance.

(4) Use the script in the trace file to

RE-CREATE THE CONTROLFILE and open the database.

*****************************************************

USER (ospid: 33751858): terminating the instance

Tue Jul 10 02:47:39 2018

opiodr aborting process unknown ospid (20054898) as a result of ORA-1092

Tue Jul 10 02:47:40 2018

System state dump requested by (instance=1, osid=33751858), summary=[abnormal instance termination].

System State dumped to trace file /oracle/diag/rdbms/aml/aml/trace/aml_diag_4260140.trc

Instance terminated by USER, pid = 33751858

Tue Jul 10 03:49:32 2018

Starting ORACLE instance (normal)

控制文件bug官方介绍：

https://support.oracle.com/epmos/faces/BugDisplay?_afrLoop=337451774049792&id=14281768&_afrWindowMode=0&_adf.ctrl-state=akve54u0x_4

四、解决方法

1、Oracle官方给出的解决方案如下：

Solution

Error is typically raised when the Controlfile is overwritten by an older copy of the Controlfile. Most likely this happened due to Storage OR I/o error.
All copies of the control file must have the same internal sequence number for Oracle to start up the database or shut it down in normal or immediate mode.

The solution is actually given with the accompained message :-

(1) Save all copies of the controlfile for later
analysis and contact your OS vendor and Oracle support.
(2) Mount the instance and issue:
ALTER DATABASE BACKUP CONTROLFILE TO TRACE;
(3) Unmount the instance.
(4) Use the script in the trace file to
RE-CREATE THE CONTROLFILE and open the database.

To make a sanity check in the future , please set the following parameter :-
SQL> alter system set "_controlfile_update_check"='HIGH' scope=spfile; -- then bounce the database.
Please check with your OS System/Storage admin regarding the issue.
The precautions is to relocate the control file on a fast and direct I/O enabled disk , the main target is not letting the OS to write an old copy (cached copy of the controlfile to it).
To reverse the parameter setting :-
SQL> alter system set "_controlfile_update_check"='OFF' scope=spfile; -- then bounce the database.

控制文件bug官方解决方案：

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=337455112646248&id=1589355.1&_afrWindowMode=0&_adf.ctrl-state=akve54u0x_54

2、通过官方给出的解决方案可以得出有以下几个方法：

1）生产数据库如果使用的是文件系统，利用IT维护或者检修时间修改IO处理机制，绕过文件系统的缓存，从而规避控制文件的bug（若Oracle在后续有修复补丁或者在新版本解决，建议升级到新的版本从根本上解决）；

2）生产数据库如果使用的是ASM，请数据库专家评估是否有必要修改隐含参数来规避控制文件bug（若Oracle在后续有修复补丁或者在新版本解决，建议升级到新的版本从根本上解决）；

3）数据库与存储最密切的就是IO，因此生产环境中尽量避免由于IO性能压力而导致触发该BUG，注意如下：

a)每个业务系统尽量在存储或者光交上规划独立端口与之对应，确保链路可以满足业务性能需求；假如存在多个业务复用端口和链路的情况，所有业务的最大性能（IOPS和带宽）小于等于所有链路带宽性能，否则会容易出现性能问题；

b)实时监控存储、服务器和数据库的性能，并建立相应告警机制，提前发现提前介入处理；

c)每一套数据库的磁盘均要求来自同一规格，比如全部都是SSD、或者都是15K再或者都是10K的机械盘，且要求存储上RAID类型一样，每组RAID磁盘数量一样。由于数据库对存储IO极其敏感，不合理规范很容易造成上层业务性能问题，甚至业务宕机的可能性；

d)如果有数据库同时使用SSD盘和机械盘，在操作系统层面、数据库层面或者第三方卷管理软件要独立管理这些不同规格硬盘划分给操作系统的LUN，避免让这些不同规格硬盘划分给操作系统的LUN同时在一个卷组或者DG中，且某张表的数据要求存放在全部都是来自SSD的卷或者全部来自相同转速机械盘的卷，否则在业务高峰期时很容易造成数据库的IO性能抖动（SSD和机械盘、不同转速的机械盘、不同级别RAID组以及不同数量的盘构成同一级别的RAID组，彼此性能之间存在不同或者很大差异），进而直接影响到数据库健康稳定运行；

e)对于存储划分过来相同规格SSD盘和机械盘的LUN给操作系统使用时，可以在操作系统层面或者数据库层面，再或者在第三方卷管理软件做二次条带，确保数据库文件上的表中数据读写IO尽量分散在存储底层所有磁盘上（尤其是机械盘）。

五、风险提示

对于重要或者核心数据库定期做好数据备份工作和灾备建设，确保业务的安全性和连续性。

知识评论当前评论数0条

创建人	赵象如
工作小组	宏杉成员
文档编辑权限	创建者私有
文档阅读权限	来自分类
分类阅读权限	所有人
分类编辑权限	技术服务部 : 机构渠道合作伙伴 : 机构系统管理员 : 人员
分类审核权限	审核小组 : 岗位
分类预览权限	审核小组 : 岗位
分类下载权限	技术服务部 : 机构

修改日期	修改人	备注
2018-09-18 17:28:03[当前版本]	赵象如	CREAT

[Title]

一、组网图

二、问题描述

三、过程分析

四、解决方法

五、风险提示