Service Reliability and Availability
- Martin Mand
- Tonis Danilson
- Helen Allas (Deactivated)
Using the Heartbeat Service
We provide a heartbeat service which will make a call every 10 seconds to the heartbeat method of your Data Receiver. Monitoring the heartbeat services allows us to alert you by email when there is a connection issue.
Your Data Receiver must also monitor heartbeats. If the heartbeat fails for more than 25-30 seconds, which means 2-3 lost heartbeats between our services, your trading system must suspend trading on any markets created or managed by our InPlay or PreMatch services to prevent bettors from placing bets at out-of-date prices. Markets should remain suspended once connection is restored until new prices are delivered; reopening markets before new prices are received could result in the offering of out-of-date prices to customers.
You may wish to have different market suspension behaviours for pre-match and in-play markets to take into account the differing levels of risk. For example, you may be happy to leave your pre-match markets open for 2-3 minutes in the event of a connection failure but you should consider suspending in-play markets after no more than 25-30 seconds.
Automatic heartbeat monitors should also be used to alert your technical team when a connection problem occurs so that response and resolution times are minimised. You will also have access to our Graylog monitoring application with several preconfigured feeds that will be set up to send email-based alerts in case any of the feeds triggers an alert condition.
Scheduled service downtime on your side
Please inform our Support team of any scheduled downtime or release on your side so that we could add this information to our monitoring suite and avoid sending out needless warnings for disconnections of the service.
Verifying Message Delivery
Messages are delivered to the Data Receiver across the internet. The acknowledgement within the HTTP protocol tells us that each message has been received so that, if a message is lost, we are able to take appropriate steps:
- We will attempt an additional one try to send a message that has not been acknowledged
- The PreMatch system has a regular round-up job that reviews all events and markets at regular intervals and sends updates for any items found to be out-of-date (for example because messages were lost)
- The InPlay system will queue updates so that when the system re-establishes connection after an outage, the latest changes to markets or events will be delivered through conflation. If the connection is restored after the event has concluded, we will only send through the ResultSet messages
Once connection is restored after an extended interruption only the latest updates will be sent. In other words, updates that are superseded by later changes in the same market will not be sent. ResultSet Updategrams are cached until they can be delivered.
Communication Interruptions
We maintain a record of the state of all markets and events being managed for a customer. This means that we know which markets should be open, which should be closed and what their prices should be.
In the event of a communication problem, for example one that stops messages from being delivered for 10 minutes, the expected actions are:
- We will continue to send heartbeat and Updategram messages
- The failure of the heartbeat:
- Will trigger an alert to the Support team, who will contact you to report the issue and you will also be alerted by the automated Graylog monitoring tool,
- Should be used by you to suspend all markets managed by our services
Once communication is restored, our services will resume updating your events and markets. You must leave markets suspended until you receive an updated price.
Price updates from the period during the communication problem will be discarded as they are out of date and at the continuation of data transfer, latest odds will be produced and sent through. Market creation or result messages will be sent again, if required, once the problem is resolved.
Highly available services
Our network and hardware have been configured to ensure that they are highly available so that in the event of an equipment failure the service itself is not interrupted. The system architecture maximises service availability by using clustered servers, load-balanced machines, redundant backups and high-availability software components as appropriate.
Monitoring the Service
Our Support team monitors the heartbeat service to ensure that we are always able to connect to your Data Receiver. If the connection fails, the heartbeat monitor service notifies our Support team so that they can investigate immediately. The Support team will notify your technical support team of the fault by email.
Monitoring Tools
We use both publicly available applications (Graphite, Graylog etc.) and tools developed in-house to ensure that the system is comprehensively monitored and to alert support staff via appropriate means.
Scheduled service upgrades
We release software updates regularly to Production every two weeks on a Monday, usually between 7 AM UTC and 10 AM UTC, followed by a UAT release the next day.
On off-weeks, we perform Infrastructure Upgrade operations between 8 AM UTC until 11 AM UTC.
For both releases and infrastructure upgrades, our service will be, unless stated otherwise, unavailable. We will of course always choose the best possible window for our work to minimize the impact to service availability.