How to Fix Replication Slot Performance or Errors?
How can we fix any PostgreSQL performance problems (or errors) from Logical Replication slots?
Any PostgreSQL source connector where the logical replication change data capture method is used. This means also where a replication slot object in the PostgreSQL database has been created.
PostgreSQL like any other database can have constraints that control its behavior. One of which in this scenario is that the PostgreSQL database uses several settings to control the behavior of the logical replication process. This includes but is not limited to the configurations associated with the Write Ahead Log (WAL). In addition, older versions versus newer versions of PostgreSQL can handle a solution differently. Because there are many possible scenarios and several approaches for resolving these types of issues, the below recommendations are provided as general approaches to handling the topic. When logical replication slot issues or performance is concerned the following terms are often associated with the same problem:
A replication slot may have performance issues, setting issues, or configuration issues. These recommended troubleshooting steps are not in any particular order, and they should all be reviewed:
- Check that the slot exists
- Run the basic query from your favorite PostgreSQL IDE. We like PgAdmin here at DLH.io so we'll reference that interface when we mention running queries, etc.
- SELECT COUNT(1) FROM pg_replication_slots WHERE slot_name = '<replication_slot>';
- Change the name of the placeholder, <replication_slot> to your replication slot name. The default we use if one is not entered is datalakehouseio_replication_slot the result is 0 the replication you thought existed does not exist.
- Check Version of PostgreSQL
- We have noticed in versions of PostgreSQL version 12 and lower that for all plug-ins such as test_decoding or pgoutput that stalling or getting stuck when handling logical replication queries is a great possibility.
- Our recommendation is to update to a 13+ version in just about all cases, but other options on this page could help overcome the issue with your database
- Turn on Checkpoint Logging
- Ensure log_checkpoint, a boolean value, is set to on in the database. This way you can track the restarts and log points of the server activity especially as it relates to replication and wal activity.
- Check Sizing
- If the size of the replication slots is very large, ex 100's or even 10's of GB or if it is hitting the max_wal_size limit there could be general issues
- Use this query on your system to get basic sizing and stats to inform you better about configurations on which logical_replication is reliant:
- select name, setting from pg_settings where name like '%wal_size%' or name like '%checkpoint%' order by name;
- If you are using version 13+ please compare the slot size to the max_slot_wal_keep_size limit and work with your DBA to determine what a configuration update and testing plan looks like for your database.
- If the max_wal_size value is fairly high and doesn't usually fill up during the logical_replication changes captured, that is usually a good sign, but you will want to confirm this during your checkpoint_timeout window.
- find the current LSN (A)
- SELECT pg_current_xlog_insert_location();
- wait for the length of the checkpoint_timeout value found (see above query), in seconds
- find the final LSN (B), after waiting for that period of time,
- SELECT pg_current_xlog_insert_location();
- Get the rough estimate of what you need to change your WAL size to by getting the result of the two values LSN(A) and LSN(B) using this command then multiple by 3 or 3.5 depending on how much leeway you wish to give your system:
- SELECT pg_xlog_location_diff('<LSN(B)_VALUE>', '<LSN(A)_VALUE>');
- without the <> brackets of course enter the real values from the retrieves
- After multiplying the number by 3 or 3.5 you will have a sense of what the max_wal_size should look like
- Now repeat these basic steps above during a period of the day where you know traffic on the database is high and track the real-world values and update your max_wal_size accordingly.
- Check WAL Buffers
- We have seen cases where wal_buffers were set too low (or never adjusted) to align with the databases recent workloads. On very activity high data volume machines we've seen effectiveness when wal_buffers was updated between 64 and 128MB. The minimum recommendation is a wal_buffer of 16MB.
- Check the MAX SLOT WAL KEEP SIZE to 2 days
- Version 13+ should technically keep WAL retention at 2 days. So, change the max_slot_wal_keep_size accordinly working with your DBA.
- Lastly, DROP and Recreate the Logical Replication Slot
- Each postgresql connection in DLH.io that uses CDC logical replication will have a logical replication slot on the database otherwise logical replication is not enabled. If there is an issue synchronizing data our support team and your DBA should and will consider the option to dropping the replication slot, and create the replication slot again, and then running a historical re-sync on the Sync Bridge to ensure all records are re-synchronized and no data is missing in the target. This operation may need to be done in general if any data is missing from any of the target tables or after any of the other configuration options above are made.
- Pause the Sync Bridge
- Drop the Replication Slot via your postgresql ide or command line,
- Recreate the Replication Slot
- Confirm it exists, SELECT COUNT(1) FROM pg_replication_slots
- Un-Pause the Sync Bridge
- In Sync Bridges, click the Run Historical Re-Sync for the Sync Bridge
- Once complete, confirm data is processing and synchronizing
NB: these responses are in no way a substitute for a qualified PostgreSQL Database Administrator (DBA). We advise that if your team is not comfortable making changes, especially in a production environment, that they contact our support team for options to assist with configuring your database.
Your PostgreSQL database administrator (DBA) is best suited for the job.
Use your support credits with us by contacting your engagement manager or using the support portal.
It may be necessary to use the answers and instructions above:
- Anytime you begin noticing that some data is not synchronizing with your target/destination on a timely basis.
- Anytime the synchronization of data from your PostgreSQL database tables noticeably takes longer than average
- When alerts regarding timeout are visible or you've been notified to address a possible issue with your source connection database
- When checking the replication slot on your PostgreSQL server, the slot seems stuck and dodoes not return any response from the WAL
- Your replication slot size has reached a disproportionate level compared to the size of your database, for example it is in multiple GBs or TBs
- The size of the replication slot borders the max_wal_size configuration limit
As an example read out of key attributes for a PostgreSQL server with low traffic that performs well with no logical replication issues, see the screenshot below. Your mileage will vary and use this only as a generic point of reference: