Overview of decentralized control systems and examples and measures to reduce DCS system failures

Decentralized Control Systems (DCS) are characterized by strong versatility, flexible system configuration, comprehensive control functions, convenient data processing, centralized display and operation, a user-friendly human-machine interface, simple and standardized installation, easy debugging, safe and reliable operation. These systems are widely applied in domestic and international power, petrochemical, chemical, metallurgy, and light industries, especially in large generator sets. Currently, there are many well-known brands used in China: (1) Foreign Brands: Honeywell, ABB, Westinghouse, Siemens, Yokogawa, etc. (2) Domestic Brands: Guodian Zhishen, Heli Time, Xinhua, Zhejiang University Central Control, etc. The safety and reliability of DCS are essential for the stable and secure operation of the unit. Any issues that arise can cause serious damage to equipment or even lead to personal safety accidents. Therefore, it is crucial to analyze various operational problems of DCS and take measures to improve its safety and reliability in thermal power plants. [Image: Overview of decentralized control systems and examples and measures to reduce DCS system failures] 2. DCS Failure in the Production Process Each manufacturer's DCS has its own characteristics, so fault analysis and handling differ. However, DCS-related issues causing second-level or higher unit faults can be categorized into three types: (1) System problems, including design and installation defects, hardware and software failures, etc. (2) Failures caused by human factors, such as misoperation due to personnel, imperfect management systems, and non-compliance with procedures. (3) DCS failures caused by external environmental factors, such as high temperature, humidity, dust, vibration, or small animals causing anomalies. 2.1 DCS System-Related Failure Examples These types of failures are common in production processes, including design and installation defects, controller (DPU or CPU) crashes, network disconnection, operator station blackouts, network congestion, software defects, low system configuration, and other system and device interface issues. 2.1.1 Power and Grounding Issues (1) A power plant's DCS power supply system uses ABBâ€™s Symphony III type power supply. However, during infrastructure construction, it was grounded according to the Type II power supply mode, which differs from the technical requirements of the Type III power supply. Since the unit started production, there have been frequent DCS module failures, signal jumps, and hardware burnouts, suspected to be related to the grounding system. Similarly, during the construction of another power plant, issues with the DCS grounding grid led to periodic fluctuations in all thermal resistance thermocouple temperature points after the system was put into operation. (2) A steam turbine control system failed due to loose power supply connections in a factory. Lessons learned: A poor grounding system and inadequate cable shielding can cause significant interference, leading to false signals and module damage. It is clear that issues like UPS power supply and control system grounding can pose great risks to the safe and stable operation of DCS once the power plant is in production. Therefore, DCS power supply design must include reliable backup means, reasonable load configuration with some margin; DCS grounding must strictly follow the manufacturer's technical requirements (if no special instructions, DLT774 should be followed); all cables entering the DCS system control signals must be high-quality shielded cables and routed separately from power cables with single-ended grounding. 2.1.2 System Configuration Problems (1) Frequent failures and crashes of a DCS (T-ME/XP system) in a Zhejiang power plant caused unit outages. From February 1997 to May, two units experienced 22 DCS system failures and crashes, resulting in 8 abnormal trips. After that, several screen failures occurred (the 8th unit had two "black screens" twice), seriously threatening the unit's safety. Analysis revealed the following issues: engineering design problems in performance calculation software and switch redundancy configuration, hardware configuration mismatch (including communication problems between T-ME and T-XP systems), individual hardware design flaws, and a bottleneck issue with the CS275 communication bus load rate. European users of the T-ME/XP system generally operate under reasonable configurations. (2) A 200 MW unit's DCS system had inaccurate load rates and technical indicators close to the allowable limit. Additionally, the system had a large number of virtual I/O points, causing the load rate of individual controllers to exceed 90% post-retrofit, with soft hand operations taking nearly 1 minute, making it unusable. After adjustment, the system was restored. (3) A 600 MW unit in the northeast faced issues due to insufficient I/O channel isolation in the bidding specifications, leading to low DCS configuration. During debugging, many I/O boards burned out, requiring changes in isolation methods and hardware upgrades, costing the power plant significantly and offsetting the original bidding price advantage. Cable quality and shielding issues must also be highly valued. Important signals and controls should use computer-specific shielded cables. Many renovation projects face re-laying due to cable problems, affecting construction timelines. (4) The engineer station of the Xinhua XDPS-400 system in a 300MW unit frequently crashed. After inspection, it was found to have many running procedures: multiple virtual DPUs, historical data records, performance calculations, and reports. Assigning historical data to other HMI stations resolved the issue. 2.1.3 Controller (DPU or CPU) Failure (1) A 300MW #2 unit's HFACS-5000CM control system FSSS1 had a faulty CPU, preventing control transfer and failing to switch to the master control, leaving part of the system inoperable. When executing an online change sequence, the main CPU failed, switching to the CUP, and the system was controlled. After replacing the original main control CPU, the system returned to normal. (2) ABB's SYMPHONY had data inconsistency in communication between different controllers in the same PCU cabinet, which was resolved through firmware upgrades. (3) In the early days of the Xinhua Control XDPS system, a batch of DPUs went offline and crashed repeatedly. After checking the capacitors on the DPU cards, the issue was resolved through card upgrades and replacements. Although current DCS controllers are redundantly configured, the number of times the main controller "abnormality" causes unit tripping is greatly reduced. However, if both redundant controllers crash simultaneously, it directly threatens safe production. Measures must be taken to avoid such situations. 2.1.4 DCS Network Failure (1) The Westinghouse WDPF control system of a power plant increased a large number of measuring points and automatic control loops due to multiple transformations. The system load rate reached over 70%, causing network congestion, and operators repeatedly operated and switched screens, leading to long black screens. Upgrading to the OVATION system resolved the issue. (2) A 600MW unit at 508MW had stable working conditions, but all steam turbine valves suddenly swung greatly. After checking the fault, the M5 controller's speed signal dropped from 3000r/min to 0r/min in a short time, then resumed. The reason for the valve swing was the data phenomenon when M3 and M5 communicated, causing the Trip Bias signal to change from 0 to 1, resulting in all valves swinging. Measures were taken: multi-processing the PCU control bus communication signal, adding a delay to the communication signal, using communication redundancy for important communication signals. 2.1.5 DCS Software Issues (1) During the DCS commissioning of a 300MW heating unit, quality parameters of the measuring point were not modified, leading to analog measuring points being considered bad only when disconnected, without full quality checks. After setting all measuring point quality parameters, the equipment operation reliability improved. (2) When configuring the HIACS-5000CM control system screen, double-clicking the grab configuration tool resulted in a pop-up C++ error window that could not be used normally. After checking, the grab.ini file was changed. After the file was overwritten by other machines, the tool returned to normal. The error message was retained in the grab.ini file because the grab did not exit normally. (3) The logic of the deaerator water level control loop was modified using the high water level control logic. The modification process was incomplete, and PID parameters were not set according to the deaerator conditions, resulting in divergence adjustment of the water separator on the deaerator during operation, deteriorating the adjustment quality. Action was taken: check the logic and re-set the PID parameters. 2.1.6 System Interface Issues A 200MW heating unit's electrical grid connection signal only had one path to the DEH. During normal operation, the electrical grid-connected auxiliary contact fault caused the turbine to trip. Measures were taken: use shielded communication cables, increase redundant contact signals, and perform 3-to-2 logic judgment. 2.2 Human Factors Causing DCS Failures Human factors causing DCS failures are also common in the production process, including misoperations by personnel, imperfect management systems, and non-compliance with work procedures. 2.2.1 Not Performing Work Steps as Specified (1) Replacing the #12 DPU fault of the DEH of a Xinhua XDPS system using the MEH system's DPU spare parts. After replacing the DPU, only the #32 master DPU was copied to the #12 sub-control unwritten electronic disk, keeping the memory content of the sub-control DPU consistent with the main control. However, the #12 DPU's electronic disk content was still controlled by the MEH small machine logic. After the system was powered off, the #12 DPU started as the master. Since the logic was MEH instead of DEH, the system communication was abnormal, data strobed, screen display was abnormal, and the human-machine interface station could not operate. After re-powering the #12 DPU and copying the #32 DPU logic, the system was normal. (2) In a power plant's HIACS-5000CM control system, the remote I/O card of the circulating pump house was replaced without performing the online replacement procedure, and the card was not activated to enter the working state, causing field device status to be inconsistent with the DCS screen. After performing the online replacement step, the system was normal. 2.2.2 Personnel Misuse (1) During the operation of a power plant unit, an employee accidentally tripped the DCS relay cabinet relay, causing the induced draft fan to trip and the boiler MFT. (2) During the replacement of a DCS card in a power plant, an employee did not carefully check the equipment and card, leading to a jumper error that caused the newly replaced card to burn. 2.2.3 Inadequate Management System (1) A power plant's DCS system management system was not perfect, and software upgrades and backups were not regulated. An operator did not back up after upgrading and patching, and the operator station hard disk failed. After system recovery, the software version was low, causing abnormal network communication and unrefreshed data. (2) A power plant's operator station was not well managed. The host USB port and optical drive in the centralized control room were not effectively closed. Some operators used the operator station to play games and watch movies during night shifts, causing the operator to crash. 2.3 DCS Failure Example Caused by External Environmental Factors While DCS failures caused by external environmental factors are relatively fewer compared to the first two types, they do occur in actual production processes. (1) The air duct between the electronic equipment of a power plant was located above the DPU cabinet. Due to design reasons, fire water flowed into the DCS cabinet through the air duct during unit operation, causing the DPU, servers, and other equipment to burn, resulting in unit shutdown. (2) The remote IO cabinet of the circulating water pump house of a power plant had poor sealing at the bottom, allowing rats to enter and build nests in the upper part of the cabinet during winter, eventually causing the remote IO to lose the dual network. (3) Poor sealing between the electronic equipment of a power plant led to serious ash accumulation on the card parts and DPU, resulting in many failures. After measures such as improving electronic enclosures and installing air conditioners, failures such as card and DPU issues were basically eliminated. Through the above failure examples, it is clear that to reduce the failure probability of the DCS system, comprehensive work must be done from selection design to operation and maintenance of the distributed control system. 3. DCS Fault Prevention and Maintenance Measures 3.1 DCS Selection Design and Debugging Regardless of new units or upgraded DCS systems, the system and controller should be configured with a focus on reliability and load factor (including redundancy). The communication bus load rate design must be controlled within a reasonable range. The controller load should be as balanced as possible to avoid the "high load" problem caused by insufficient capital and affecting the safe operation of the system. 3.1.2 The allocation of system control logic should not be excessively concentrated on a certain controller. The main controller should be redundant. 3.1.3 Power supply design must be reasonable and reliable. First, emphasize the load rate of the power supply design; second, emphasize the redundant configuration of the power supply and ensure the independence of the two power supplies. 3.1.4 Pay attention to the reliability measures of the DCS system interface. Emphasize the redundancy of important interfaces and the choice of interface methods, mainly paying attention to reliability and real-time. 3.1.5 For DCS system grounding, it must be implemented according to the manufacturer's requirements to avoid large-scale system failures caused by grounding problems. Consideration should be given to the system's anti-jamming measures, self-diagnosis and self-recovery capabilities, and I/O channels should emphasize isolation measures. Cable quality and shielding issues must also be highly valued. Important signals and controls should use computer-specific shielded cables. 3.1.6 Fully consider the controllability of the main and auxiliary equipment. The operator station and the backup hand-operated device should be configured according to the operating characteristics of the equipment and the requirements for the unit to handle emergency faults under various working conditions. The emergency shutdown button configuration should use a separate operating circuit separate from the DCS. At the same time, we must not blindly pursue the "simplification" of the human-machine interface, and the system configuration should also meet the safety production as the first place. Special emergency interventions related to safety cannot be fully established on the basis of DCS. 3.1.7 For peripherals such as actuators and valves involved in unit safety, when designing and configuring, ensure that these critical equipment can move in a safe direction or remain in place in the event of loss of power, loss of air, loss of signal, or failure of the DCS system. 3.1.8 For the protection system, the multiplexed signal ingestion method should be adopted, and the blocking condition should be reasonably used to make the signal loop have the logic judgment ability. 3.1.9 All logic, loops, and operating conditions are tested during commissioning according to the commissioning outline and specific methods. 3.2 DCS Operation, Start-Stop Maintenance 3.2.1 Prepare for Maintenance Do a good job in the maintenance of the DCS system, including: (1) Maintenance personnel should understand the overall design of the system. Familiar with DCS system structure and function composition, understand system equipment hardware knowledge, familiar with various components such as controller, IO card, power supply, and other normal status and abnormal state, proficient in DCS configuration software. (2) System backup: including operating system, driver, boot disk, control system software, license disk, control configuration database, and control configuration data is up-to-date and complete. In view of the shortcomings of the actual use of the optical disc is easy to wear, pay attention to do more backup, and use mobile hard disk, U disk, hard disk and other backup forms to ensure the preservation of each software. (3) Hardware reserve: For parts that are vulnerable and have short use period and key components such as keyboard and mouse, I/O module, power supply, communication card, etc., it should be backed up according to the actual situation to ensure various types of card parts and module spare parts. Not less than one, and stored in accordance with the requirements of the manufacturer, if there are conditions to verify the spare parts, and truly grasp the status of the spare parts module. (4) Organize the scope and timetable of after-sales service of various products, form a communication record of technical support personnel of hardware manufacturers and system design units, and fully utilize the technical support of DCS suppliers and system design units. 3.2.2 Daily Maintenance The daily maintenance of the system is the basis for the stable and efficient operation of the DCS system. The main maintenance work has the following points: (1) Improve the DCS system management system according to the provisions of 25 counter-measures, DL/T774 maintenance and repair procedures and other institutional documents. (2) Ensure good sealing between electronic equipment, prevent small animals from entering, reduce the adverse effects of dust on component operation and heat dissipation, ensure that temperature and humidity comply with manufacturer's regulations, and avoid system equipment caused by sudden changes in temperature and humidity. Condensation on it. It can be considered to introduce the ambient temperature signal between the DCS electrons into the CRT and have an alarm. (3) Check whether the fans in the system cabinet are working properly and the air ducts are blocked every day to ensure that the equipment in the system can operate reliably for a long time. (4) Guarantee the quality of the system power supply and reliably supply power to the two power supplies. When any power supply is lost, it will alarm. (5) It is forbidden to use wireless communication tools between electronic devices to avoid electromagnetic field interference to the system, to avoid moving stations, monitors, etc., to avoid pulling or bumping equipment connection cables and communication cables. (6) Standardize the management of DCS system software and application software. The modification, update and upgrade of software must fulfill the approval authority and the responsible person system. It is strictly forbidden to use non-genuine software and install system-independent software to do the closed management of the host USB port and optical drive. (7) Do a good job of system data recording such as PID parameters of each control loop, positive and negative regulators. (8) Check whether the hardware such as the control host, monitor, mouse, keyboard, etc. is intact, and the real-time monitoring work is normal. Check the diagnosis screen for faults. (9) DCS equipment, including DPU and human-machine interface stations, should be powered on one by one in a certain order. After each device is powered on and observed normally, the next device is powered on to avoid abnormal analysis. After power-on, the communication connector cannot be in contact with the electrical conductors such as the cabinet, and the redundant communication lines and communication connectors cannot be touched together to avoid burning the communication network card. (10) Conduct online testing of the communication load rate of the DCS main system and all related systems connected to the main system on a regular basis. Check the status of redundant master and slave devices, conditionally permit or periodically switch between master and slave devices, and check and analyze the reasons for the device to switch. (11) Increased configuration readability: Added Chinese description to important configuration pages; detailed logic specification for writing and configuration of important protection systems; preparation of test operation cards and guaranteed update at any time. Standardize DCS configuration operations, and try not to make major configuration changes during unit operation. Care must be taken when configuring, and adequate technical measures and safety measures should be taken to ensure the safe and stable operation of the DCS and the unit. (12) Restart all HMI stations one by one on a regular basis (recommended for about 2 or 3 months) to eliminate the cumulative error of long-term computer operation. 3.2.3 Outage Maintenance The DCS system should be thoroughly maintained during the maintenance of the unit, including: (1) Reset the DPU, CPU and operator station and data station of the DCS system one by one by using the unit maintenance time; delete the invalid I/O points in the configuration and optimize the configuration. (2) System redundancy test: Redundancy test is performed on redundant power supplies, servers, controllers, and communication networks. Observe whether the master-slave device switches, the network, and the human-machine interface station are normal when the devices are powered off during the outage of the system. After the system is repaired and re-powered, the devices are tested for switching. (3) System dust removal: When the system is out of service, the entire system performs soot blowing, including dust cleaning of components inside the computer, control station cage, power box, fan, cabinet filter, etc. (4) The system power supply line is overhauled, and the power supply capability test and discharge operation of the UPS are performed. At the same time, check the CMOS battery power of the DPU host card and replace it regularly to prevent CMOS data loss caused by the battery. (5) Grounding system maintenance. Including terminal inspection, ground resistance test. (6) On-site equipment maintenance, according to the maintenance and repair procedures, refer to the relevant equipment manual. (7) Check the interface between the DCS system and other systems, deal with important signals redundantly, and communicate with other systems according to their specific conditions, adopt one-way transmission and install firewall measures. (8) System power-on: After the system is overhauled, the person in charge of maintenance can confirm that the conditions are met before powering on. It should be carried out in strict accordance with the power-on procedure. 3.2.4 Troubleshooting Maintenance The system should be passively maintained after a failure, which mainly includes the following tasks: (1) In the daily work, we should carefully follow the 25 counter-measures and fully make all kinds of accident predictions including DPU (CPU) crash and network communication collapse, and run emergency treatment measures, safety measures, technical measures, and maintenance steps. Prepared in a book to ensure the safe operation of the unit. (2) Handling DCS failures According to the requirements in the manufacturer's application manual, confirm the card module model and address (should ensure that the address does not conflict with other equipment addresses), jumpers, etc. are consistent with the replaced card and strictly enforce online before replacement. Replace the program. (3) Passive maintenance of the fault should also strictly implement the work ticket system to avoid rushing into the repair, and should be analyzed in detail in combination with the specific fault performance. According to the DCS system self-diagnosis alarm, fault phenomenon judgment, find the fault point, and verify the repair result by eliminating the alarm. For example, if the communication connector is in poor contact, it may cause communication failure. After confirming that the communication connector is in poor contact, use the tool to redo the connector; if the communication line is damaged, it should be replaced in time. A card fault light flashes or all data on the card is zero. The possible reason is that the configuration information is wrong, the card is in the standby state, the redundant terminal cable is not connected, the card itself is faulty, and the slot is not available. Configuration information, etc. When a certain production state is abnormal or an alarm, you can first find the instrument that reflects this state, and then follow the signal in the direction of upward transmission, and use the instrument to check the correctness of the signal one by one until the fault is detected. (4) Field equipment troubleshooting must be issued with work tickets, DCS enforcement and isolation measures. A bypass valve should be used when servicing the valve. After the completion of the overhaul, notify the centralized control personnel to conduct the inspection, and the operator should cut the automatic control loop into manual. (5) When there is a large-scale hardware failure, an unexplained fault, or a fault that exceeds the technical level of the maintenance personnel of the factory, in addition to the emergency spare parts replacement work at that time, it is necessary to contact the manufacturer in time, and the professional technical support engineer of the manufacturer further Confirm and troubleshoot. 4. Conclusion DCS should carry out all-round management from design, construction, commissioning and operation. As system maintenance personnel, scientific, reasonable and feasible maintenance strategies and methods should be formulated according to system configuration and production equipment control to prevent preventive maintenance. The daily maintenance is closely coordinated with systematic, planned and regular maintenance. Specific faults should be analyzed in detail for various faults that occur during operation. The key to reducing DCS failure is to prevent first, and ensure that the system operates well in the required environment for a long time.

IC SOCKET

HuiZhou Antenk Electronics Co., LTD , https://www.atkconn.com