Logical Replication Setup
Logical Replication is the main means for allowing PostgreSQL to have Change Data Capture (CDC) logic to track changes and retrieve only the delta/changes to your database tables synchronized. This increases speed of the synchronization so that any system synchronizing data need not retrieve all records from the database tables all the time on each synchronization frequency event. To set up logical replication on your Postgres database please follow the instructions on this page.
DLH.io supports both pgoutput and test_decoding logical replication plugin extensions types. pgoutput proves to provide the most granular selectivity of CDC replication capability. pgoutput allows control of individual tables of individual schemas to only be chosen for replication, as well as an option for all tables across all schemas in the database. The test_decoding plugin configuration only allows CDC replication control at the database level (thus, all tables of all schemas) which can be overkill for some data pipeline requirements. Both pg_output and test_decoding replication allow for insert, updates, and deletes to be tracked where XMIN does not track deletes.
Use the links below for instruction on how to set up your respective logical replication mode:
Because logical replication is based on the core of the underlying postgresql system and WAL, there are some common analysis approaches that can be taken across all types of logical replication as described below.
This table provides the basic information about the logic replication you have created. When DataLakeHouse.io is retrieving from the slot, the active colum will typically be set to "t" instead of "f" for true and false repsectively.
There are several parameters for working with the pg_logical_slot_peek_changes function as further described here, https://pgpedia.info/p/pg_logical_slot_peek_changes.html, for testing the flow of WAL changes as identifying the LSNs.
Is not leveraged by DataLakeHouse.io as this relates to a streaming replication standby replication instance, where DataLakeHouse.io is on-demand frequency sync scheduled query-based.
Aiven.io has a good article on standard operation increase in disk space due to WAL, https://developer.aiven.io/docs/products/postgresql/concepts/pg-disk-usage.html
Run any of the main postgreSQL commands to check disk space on the instance:
- \l or \l+
- SELECT pg_size_pretty(pg_database_size('yourdbname'));
In a standard process where the DDL of your postgreSQL tables do not change much there is very little cause for failure other than not have a sync frequency that is not aligned with the volume of data your database tables produce commensurate with the disk space of the database/server. Since Logical Replication does not track DDL changes an error could occur if a DDL change is made but downstream impacts are not considered. In the case of DataLakeHouse.io a manual change may be required on your target system in order to reflect a DDL change on your source system.
When errors, if any occur, please report them immediately by opening up a support ticket. Methods used on your source database side to clear issues may include things such as using:
- pg_replication_origin_advance, to skip the transaction that is failing
- (https://www.postgresql.org/docs/9.4/catalog-pg-replication-slots.html)PostgreSQL Replication Slots Fields/Columns Available