How to Fix Replication Slot Performance or Errors?

7 min

what (question)? how can we fix any postgresql performance problems (or errors) from logical replication slots? where? any postgresql source connector where the logical replication change data capture method is used this means also where a replication slot object in the postgresql database has been created how (answer)? postgresql like any other database can have constraints that control its behavior one of which in this scenario is that the postgresql database uses several settings to control the behavior of the logical replication process this includes but is not limited to the configurations associated with the write ahead log (wal) in addition, older versions versus newer versions of postgresql can handle a solution differently because there are many possible scenarios and several approaches for resolving these types of issues, the below recommendations are provided as general approaches to handling the topic when logical replication slot issues or performance is concerned the following terms are often associated with the same problem stuck slow sizing recommendations a replication slot may have performance issues, setting issues, or configuration issues these recommended troubleshooting steps are not in any particular order, and they should all be reviewed check that the slot exists run the basic query from your favorite postgresql ide we like pgadmin here at dlh io so we'll reference that interface when we mention running queries, etc select count(1) from pg replication slots where slot name = '\<replication slot>'; change the name of the placeholder, \<replication slot> to your replication slot name the default we use if one is not entered is datalakehouseio replication slot the result is 0 the replication you thought existed does not exist check version of postgresql we have noticed in versions of postgresql version 12 and lower that for all plug ins such as test decoding or pgoutput that stalling or getting stuck when handling logical replication queries is a great possibility our recommendation is to update to a 13+ version in just about all cases, but other options on this page could help overcome the issue with your database turn on checkpoint logging ensure log checkpoint, a boolean value, is set to on in the database this way you can track the restarts and log points of the server activity especially as it relates to replication and wal activity check sizing if the size of the replication slots is very large, ex 100's or even 10's of gb or if it is hitting the max wal size limit there could be general issues use this query on your system to get basic sizing and stats to inform you better about configurations on which logical replication is reliant select name, setting from pg settings where name like '%wal size%' or name like '%checkpoint%' order by name; if you are using version 13+ please compare the slot size to the max slot wal keep size limit and work with your dba to determine what a configuration update and testing plan looks like for your database if the max wal size value is fairly high and doesn't usually fill up during the logical replication changes captured, that is usually a good sign, but you will want to confirm this during your checkpoint timeout window find the current lsn (a) select pg current xlog insert location(); wait for the length of the checkpoint timeout value found (see above query), in seconds find the final lsn (b), after waiting for that period of time, select pg current xlog insert location(); get the rough estimate of what you need to change your wal size to by getting the result of the two values lsn(a) and lsn(b) using this command then multiple by 3 or 3 5 depending on how much leeway you wish to give your system select pg xlog location diff('\<lsn(b) value>', '\<lsn(a) value>'); without the <> brackets of course enter the real values from the retrieves after multiplying the number by 3 or 3 5 you will have a sense of what the max wal size should look like now repeat these basic steps above during a period of the day where you know traffic on the database is high and track the real world values and update your max wal size accordingly check wal buffers we have seen cases where wal buffers were set too low (or never adjusted) to align with the databases recent workloads on very activity high data volume machines we've seen effectiveness when wal buffers was updated between 64 and 128mb the minimum recommendation is a wal buffer of 16mb check the max slot wal keep size to 2 days version 13+ should technically keep wal retention at 2 days so, change the max slot wal keep size accordinly working with your dba lastly, drop and recreate the logical replication slot each postgresql connection in dlh io that uses cdc logical replication will have a logical replication slot on the database otherwise logical replication is not enabled if there is an issue synchronizing data our support team and your dba should and will consider the option to dropping the replication slot, and create the replication slot again, and then running a historical re sync on the sync bridge to ensure all records are re synchronized and no data is missing in the target this operation may need to be done in general if any data is missing from any of the target tables or after any of the other configuration options above are made pause the sync bridge drop the replication slot via your postgresql ide or command line, recreate the replication slot confirm it exists, select count(1) from pg replication slots un pause the sync bridge in sync bridges, click the run historical re sync for the sync bridge once complete, confirm data is processing and synchronizing nb these responses are in no way a substitute for a qualified postgresql database administrator (dba) we advise that if your team is not comfortable making changes, especially in a production environment, that they contact our support team for options to assist with configuring your database who? your postgresql database administrator (dba) is best suited for the job use your support credits with us by contacting your engagement manager or using the support portal when? it may be necessary to use the answers and instructions above anytime you begin noticing that some data is not synchronizing with your target/destination on a timely basis anytime the synchronization of data from your postgresql database tables noticeably takes longer than average when alerts regarding timeout are visible or you've been notified to address a possible issue with your source connection database when checking the replication slot on your postgresql server, the slot seems stuck and dodoes not return any response from the wal your replication slot size has reached a disproportionate level compared to the size of your database, for example it is in multiple gbs or tbs the size of the replication slot borders the max wal size configuration limit furthermore as an example read out of key attributes for a postgresql server with low traffic that performs well with no logical replication issues, see the screenshot below your mileage will vary and use this only as a generic point of reference

How to Fix XMIN Issues

PostgreSQL (Aiven.io)