Capacity not being released

Resolved
Aug 19 at 06:00pm CEST

Incident Report - Capacity not released #180822-01 / #111002

Introduction

On the 18th of August 2022, we became aware that capacities were not properly being released after uncompleted bookings, causing timeslots to fill up. This incident resulted in customers no longer being able to book and therefore we have written an extensive report which can be found below.

In future correspondence this incident will be referred to as #180822-01 / #111002 (Capacity not released).

We hope this information is useful to you. If you have any questions regarding this incident, please consult your contact person directly.

Executive Summary

A scheduled deployment introduced a bug in the system that prevented capacities from being released. Therefore the affected job, responsible for releasing capacities on a scheduled basis, was no longer instructed to execute all necessary commands.
The code that caused the interruption was implemented as part of a new feature called "partner block capacity", meant to allow for individual capacity allocation to each partner. The nature of the exception allowed for the code block to be fully executed while skipping a crucial set of commands intented to verify the succesfull completion of the task.
For that reason it took longer for the team to identify the problem as a positive exit code was returned.

Prioticket has taken actions to implement additional safeguards to reduce the risk of similar occurences in the future.

Impacted services
- Capacity Manager
- Guest Manifest
- Calendars
- Bookings

Actions leading up to the incident

A deployment on a critical infrastructure component has been performed without the proper post deployment testing. Despite successful automated testing and manual verification of the postdeploy results, our team was unable to identify a potential problem with the release.
After investigation it was found that similar behaviour could not have been replicated on our testing and staging environments, prohibiting us from finding the issue in an earlier stage. This due to differences in server configuration between environments.

Existing safeguards

Capacity management is one of the core processes within our reservation system and therefore we have numerous checks in place to prevent, detect and mitigate availability issues.

Some common checks include:

Auto-detect de-synchronization between nodes.
Auto-detect capacity mismatches between modules.
Auto-detect capacity mismatches between capacity count and booking count.
Auto-detect overbookings (when explicitly disallowed).
Auto-detect server degradation (connectivity/performance).

Our system will perform these automated checks periodically to make sure that there are no differences between the total amount of bookings being made and the capacity count being shown to customers.

Besides continuous monitoring we also have checks that run in batches on periodic time intervals, such as the booking count validator. These checks run every few hours to make sure potential gaps are allocated and notified to our team. Each batch contains a subset of capacities, but each capacity is checked at least once every 24 hours.

Unfortunately above safeguards have been implemented with a overbooking / sales count mismatch in mind. At the time the blocked capacity, responsible for temporary holds in the shoppingcart, were not taken into account.
For that reason, even with overbooking and sales mismatch checks in place, we were unable to properly detect the block count not being released.

Actions taken

During the incident we manually initiated and released all capacities while searching for the root cause of the problem. After it became apparent that it was caused by a certain deployment, we immediately reverted back to a previous release.
In the days following this incident we will implement automated checks that will verify that capacity is being released at all times. In combination with the existing overbooking and sales count comparisons we will ensure additional resilience in our capacity module.

We regret that this incident may have affected you. We take our obligation to safeguard system stability and reliability very seriously and are taking steps to help prevent this type of incident from reoccurring.

Updated
Aug 18 at 02:45pm CEST

Status: Resolved

All capacities have been released and we are monitoring the results.

Updated
Aug 18 at 01:30pm CEST

Status: Recovery

The fix has been implemented and capacities are being released.

Updated
Aug 18 at 12:05pm CEST

Status: Identified

The issue has been identified and a fix is being implemented.

Created
Aug 18 at 10:30am CEST

Status: Investigating

Prioticket is investigating an issue where capacity is not being released. We are working to analyse and mitigate this problem. More updates to follow shortly.