WAN backups failing

SUMMARY

WAN backups or backups across an ISP link through firewalls

ISSUE

Backups through a layer 3 gateway firewall or especially over an ISP hosted link will fail with network connection stopped or verify failures (where both checksums are not present, indicating backup completion was not made).  Timeouts in backup may also be reported.  

For more detailed information on backup failures and performance issues see Common Backup Failures and Performance Issues

RESOLUTION

Recommendations / Best practices
1. For Windows systems, must use agent 7.5 or newer. There are specific changes that have been made to tune the TCP socket
timeouts that will keep the sockets open even in case of inactivity (VSS snapshots taking too long etc).

2. Individual protected server / VM size, change rate not to exceed the following for the given network speeds:
Recommendations / Best practices
a. For Windows systems, must use agent 7.5 or newer. There are specific changes that have been made to tune the TCP socket
timeouts that will keep the sockets open even in case of inactivity (VSS snapshots taking too long etc).
b. Individual protected server / VM size, change rate not to exceed the following for the given network speeds:
Network speed                      Max individual server or comnbined backup thread size                      Max daily change rate
<T3                                 Not Supported
T3 (45mbps)                         300GB                                                                                5%
Fast ethernet      (100Mbit)        800GB                                                                                7%
O-24 (1.24gbit)                     See appliance sizing guide                                                           See appliance sizing guide
*above assumes 100% bandwidth is allocated without filters, throttles, or packet inspection. 

3. To minimize the amount of data being sent over the WAN / MAN, Incremental forever must be the default strategy used, leveraging the change journal for incremental backups. This prevents the file system being scanned for changes and minimizes idle time across connections causing routers to drop the connection.

4. Number of files per GB on the protected servers must be 1 million / TB, to minimize the NTFS traversal overhead on the network latency.

5. The Technical Audit must capture the WAN / MAN bandwidth available (uplink and downlink) after QoS and any throttling the customer may have.

6. The round trip time (RTT) (ping returns this information) must be 20 ms or less. This information must be captured in the technical audit.

7. Packet loss must be less than 1.5%.

8. Any QoS settings that capture packet loss, retransmits, network latency, jitter must be captured in the TA, if possible.

9. It is recommended that full system restores or BMR restores be done locally at the site of the appliance instead of doing so over the WAN.

10. If the customer has a WAN with high latency, it may be advisable to reduce the MUX concurrency down through appropriate backup scheduling or changing the concurrency count for the device.

11. Only Windows and apps / Linux variants / VMware / Hyper-V supported at this time.

CAUSE

Factors affecting backups / restores
There are a number of factors that affect the performance of a backup over a WAN / MAN:

1. Network resilience:
To perform backups of any sizeable nature, the network connection between the protected asset and the backup appliance must not drop during the execution of the backup. The current backup method does not support fail/ resume capabilities which imply that the connection must stay alive for the duration of the backup.

2. Network latency:
If the network latency between the protected asset and the backup appliance is high, there is a chance that the connection may be reset due to timeouts.

3. Change rate:
The amount of data being transferred over the WAN / MAN must be minimized to decrease the probability of failure due to network connection drops.

4. OS being backed up:
Some operating systems being protected are more resilient to inactivity on the TCP channels than others. For example, Linux variants have been more resilient for protection over WAN/MAN than Windows.

NOTES

What is not applicable
This document does NOT apply to end-point backups which have non-persistent connections. The protection paradigm for end points is not the same as servers as the end point has to initiate / resume a backup when it is in the network accessibility range of the backup server. This paradigm is opposite to the way servers / hypervisors etc are protected where the job scheduling is controlled by the backup appliance.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Contact us