SAP HANA System Replication Monitoring — the Splunk Way (March 2023)

Downtime is the consequence of outages, which may be intentional (for example, for system upgrades) or caused by unplanned faults. A fault can be due to equipment malfunction, software, or network failures — or even due to a major disaster such as a fire, a regional power loss or a construction accident — which may decommission the entire datacenter. 

As an in-memory database, SAP HANA is not only concerned with maintaining the reliability of its data in the event of failures, but also with resuming operations with most of that data loaded back in as memory — as quickly as possible. 

As an SAP Technical Architect for Fortune 500 companies involved in design and implementation of SAP Migration and Replication Projects, SAP Migration Projects require huge team and manual efforts to plan and execute the cutover activities necessary to successfully implement the SAP migration.

At RHONDOS, with the help of SAP PowerConnect and consultants with years of experience in SAP, providing visibility and insights into SAP systems simplifies the planning and preparation for a successful SAP S/4HANA system replication. SAP-certified monitoring solutions like SAP PowerConnect have helped Fortune 500 companies with complex business processes and a vast amount of data to manage and ensure business continuity, data availability, and disaster recovery — improving performance, enhancing flexibility, enabling testing, and reducing costs.

So, how do we avoid undesired downtime situations? What type of support is available in SAP HANA for high availability

SAP HANA supports the following recovery measures from failures: 

Disaster recovery support: 

  • Backups: Periodic saving of database copies in safe places; 

  • Storage replication: Continuous replication (mirroring) between primary storage and backup storage over a network (may be synchronous);

  • System replication: Continuous update of secondary systems by primary system, including in-memory table loading. 

Fault recovery support: 

  • Service auto-restart: Automatic restart of stopped services on host (watchdog);

  • Host auto-failover: Automatic failover from crashed host to standby host in the same system; 

  • System replication: Continuous update of secondary systems by primary system, including in-memory table loading and read-only access on the secondary. 

How does system replication work?

Once SAP HANA system replication is enabled, each server process on the secondary system establishes a connection with its primary system counterpart and requests a snapshot of the data. From then on, all logged changes in the primary system are replicated continuously. Whenever logs are persisted (meaning they are written to the log volumes of each service) in the primary system, they are also sent to the secondary system. 

Common issues in system replication 

What if the network connection between the primary and secondary site is lost? 

The connection between the primary and the secondary system must be available for replication. If this is not the case for a certain time, the redo log cannot be shipped to the secondary system, the log segments start piling up on the primary, and the secondary system is not ready for takeover.  

What if there are intermittent connectivity problems? 

A common intermittent error is that the log buffer is not shipped in a timely fashion from the primary to the secondary site – the system replication log replay backlog increases. A delayed log replay on the secondary system causes a longer takeover time.  

What is the impact on business if system replication issues occur? 

If system replication in SAP HANA doesn’t go as planned, it can have a significant impact on a business, which can lead to: 

Data loss: if replication fails, the secondary system may not have the latest data, which can result in data loss. 

  • Downtime: if the primary system fails, the secondary system should take over to ensure continuity of operations. If replication is not working, the secondary system may not be able to take over, leading to unplanned downtime; 

  • Business disruption: if the primary system goes down and the secondary system is not available, it can disrupt the normal business operations and negatively impact the company's operations, revenue, and reputation; 

  • Increased costs: if the primary system fails, manual intervention may be required to resolve the issue, which can be time-consuming and costly. 

    In summary, it's critically important for system replication in SAP HANA to be working properly to avoid potential business disruptions and ensure the continuity of operations. 

How does monitoring SAP HANA replication help? 

Monitoring the replication process in real-time ensures that the data on the backup system is consistent and up-to-date with the primary system, minimizing the risk of data loss or corruption. Instant detection and resolution of performance bottlenecks or issues dramatically helps to improve the overall performance of the SAP HANA system and helps organizations to comply with various regulatory requirements for data protection and disaster recovery. 

SAP HANA Replication Monitoring Dashboard in Splunk 

Prerequisite in SAP:  

  • Install and enable the PowerConnect Agent (ABAP/JAVA/CLOUD) on the SAP system;

PowerConnect Agent Group Definition

Enable HANA Alert Extractor

Enable HDBSCRIPTS Extractor

Prerequisite in Splunk:

Steps to install PowerConnect Agent

Download software package: SAP PowerConnect for Splunk | Splunkbase

Dashboards — SAP HANA Replication Monitoring:

KPIs:  

Panel 1: Trend of HANA replication issues over time

The first panel provides visualizations of when connection issues/configuration parameter mismatch/log relay backlog/Increased log Shipping Backlog/ASYNC Replication In-Memory Buffer Overflow/Inconsistent fallback snapshot/system replication support issues in ESS are occurring over the time interval selected. 

Panel 2: Count of replication issues

The pie chart illustrates the count of replication issues which gives insights to what the most common HANA replication issue is, drastically reducing time dedicated to analysis and resolution. 

Panel 3: User action and next steps

The third panel provides more details about the Alert, for example: 

  • Alert Details: visibility into what the issue is; example - log shipping timeout occurred; 

  • Alert Description: what is the cause of replication issue; 

  • Alert User Action: action to be taken to resolve the issue. 

HANA — Diagnostic Files

KPIs:
Panel 1: Top 10 Files Based on Size – the box indicates there was dump for log shipping timeout. 

If the primary system does not receive the acknowledgment for a sent log buffer within the time defined by logshipping_timeout, it closes the connection to the secondary system to continue data processing. This is done to prevent the primary system from blocking transaction processing if there is a hang situation on the connection to the secondary system. 

HANA System Replication Mini Checks 

Mini Checks provides insights on several replication KPIs/metrics. We can configure/change the threshold based on requirements to receive real-time alert notifications.

With proactive alerting and monitoring dashboards to aid in system replication, RHONDOS has helped a variety of customers from the entertainment and beverage industries implement a reliable, high-performance database system that can be used for disaster recovery, data consolidation, data integration, and real-time analytics to ensure business continuity — all by providing a reliable and fast failover mechanism. 

References: 

Blog Posts – by the SAP HANA Academy | SAP Blogs 

SAP HANA System Replication | SAP Help Portal 

System Replication - SAP HANA - Support Wiki 

Previous
Previous

SAP Monitoring with ChatGPT? (July 2023)

Next
Next

RHONDOS Names Maria Yazar Director of Customer Success (February 2023)