Vous êtes sur la page 1sur 9

FAQ: How CDC refresh works

Q: Does CDC need to stop mirroring when performing a REFRESH


operation?

A: Before refreshing a set of tables from source to target, the subscription must
be stopped if currently in mirroring status.
Once the subscription is stopped, a number of tables can be selected for a
REFRESH operation.

Q: What happens when mirroring is restarted for the subscription which


initiated the REFRESH operation?

A: When the REFRESH operation is completed successfully by CDC, the product


can be restarted for mirroring. CDC will then process the backlog of changes
made to the database since mirroring was stopped (which was just before the
REFRESH operation began).

Q: Does CDC perform REFRESH by transferring multiple tables in parallel?

A: CDC can REFRESH a set of selected tables as one operation, but will only
process a REFRESH for one table at a time within a single subscription. To
perform parallel refresh, multiple subscriptions can be used.

Q: Can a subset of tables be part of REFRESH, or do all tables and/or


subscriptions need to be part of a REFRESH operation?

A: CDC will REFRESH a set of tables selected within a single subscription as


one operation which will run to completion. Each table is refreshed individually
until all selected tables have finished REFRESH. Since this is an operation
which applies within a single subscription, other subscriptions are not affected,
they may continue mirroring data for different tables, or refreshing different tables
as required.

Q: What source database configuration or table specification would


prevent the use of REFRESH?

A: CDC can support REFRESH on any tables which are supported for mirroring.

Q: How does CDC retrieve data for the REFRESH operation?


A: CDC will query the source database table for a REFRESH operation which will
cause a table scan to provide the rows.

Q: Is there such as thing as tables "too large" to REFRESH?

A: Practically, yes. The CDC source engine REFRESH query may cause the
source database to perform significant read I/O during the REFRESH. The
database may take many hours to perform a "table scan" operation to provide
rows. Scanning large tables can cause significant disk I/O contention for
databases which have changing data, or maintenance operations which are
active during the REFRESH operation.

Q: Are there alternatives for REFRESH?

A: Database extract / import, backup / restore, insert into select * from (remote
database linked table), operations are used by customers instead of REFRESH
when these options are more suitable.

Q: Does CDC perform an ORDER BY sort when retrieving data for


REFRESH?

A: During standard REFRESH, no ORDER BY is used, therefore the database is


free to return data in any order that it best decides.

During "Differential" REFRESH operations, CDC must query using an ORDER


BY on the table keys (as mapped in CDC) to sort the source and target tables in
order to determining the differences between the two tables.

Q: Does CDC use bulk extract utilities to obtain rows for REFRESH?

A: CDC uses a SQL query to obtain rows during REFRESH. CDC does not use
bulk extract.

Q: Can CDC query a separate backup/copy/standby database rather than


the source database during a REFRESH to avoid read I/O on the source
database?

A: CDC queries the source database tables directly, and does not have an option
to redirect queries to backup, standby or tables stored on different databases.
The source table mapped for a subscription is the one queried.
Q: Does CDC require source database tables to be quiescent during the
Refresh?

A: CDC has "REFRESH while active" logic that allows REFRESH during periods
where the source database is processing changes (Insert, Update, Delete) to
tables involved in the REFRESH operation.

Q: What overhead does the Refresh place on CPU/Disk IO etc?

A: CDC uses very little CPU on the source system during standard REFRESH
operations. The majority of the processing is disk read I/O on the tables involved
in the REFRESH.

Q: How does CDC obtain a transactionally complete 'snapshot' for tables


that have active changes being performed on them?

A: Oracle specific : CDC will use a method which opens a transactional read-only
snapshot query on the REFRESH operations table which will cause the undo
space in the source database to be utilized during the length of the query. The
DBA should review this as very large tables (>50M rows) which are changed
freqeuently (>100 changes per second) can require larger amounts of undo
space.

A: CDC in general supports a capability known as "REFRESH while active". The


source database log position at the time of the source database REFRESH query
is stored as the "start point" in the CDC source metadata. When the REFRESH
completes for this table, the log position of the source database is stored as the
"end point" in the CDC source metadata. When mirroring is restarted for this
table, changes that are made within the "start to end points" will be sent to the
target with an "in-doubt" flag. The CDC target will issue the changes made
during mirroring as per normal, but if an error happens, CDC target will check for
the "in-doubt" flag and ignore this error because the row was already replicated
during the refresh, and therefore caused an INSERT duplicate or DELETE not
found or other violation.

Q: When does the target table get truncated for the refresh?

A1. The CDC source sends "START_REFRESH" message before selecting rows
from table. The CDC target will do the truncate when received "START
REFRESH" message. The CDC source then sends data records for the target
table which are applied to completion or error.
Q: Are there any recommendations for maintenance procedures or other
operations to be commenced before a REFRESH?

A: CDC will query the source database leading to increased disk read I/O for
those tables which will be part of the REFRESH. CDC will apply these changes
to the target database tables. If you have database maintenance procedures
such as backup, re-index, or other disk intensive operations scheduled, these
may cause some level of disk I/O contention. The DBA team should review the
opportunity to schedule the REFRESH of large tables when it best suits the
source and target databases.

Q: How is data loaded into a DB2 UDB LUW database by the CDC DB2
target engine REFRESH?

A: CDC DB2 uses the DB2 bulk load utility to INSERT the refreshed rows into the
target database. This behavior can be changed to use a JDBC SQL based
INSERT operation if the customer decides not to use bulk load. Note however
that bulk load is by far the fastest method of loading refresh data into a target
database in most cases.

Q: How is data loaded into a Oracle database by the CDC Oracle target
replication engine during a REFRESH?

A: CDC Oracle uses the Oracle OCI DirectPathLoad bulk loader API to INSERT
the refreshed rows into the target database. The OCI DirectPathLoad API avoids
staging bulk load files on disk by utilizing in-memory loading. This behavior can
be changed to use a JDBC SQL based INSERT operation if the customer
decides not to use bulk load. Note however that bulk load is by far the fastest
method of loading refresh data into a target database in most cases.

Q: How is data loaded into a Teradata database by the CDC Teradata target
replication engine during a REFRESH?

A: CDC Teradata uses the Teradata FASTLOAD bulk load utility to INSERT the
refreshed rows into the target database. This behavior can be changed to use a
JDBC SQL based INSERT operation if the customer decides not to use bulk
load. Note however that bulk load is by far the fastest method of loading refresh
data into a target database in most cases.

Q: How is data loaded into DB2/z databases by the CDC target replication
engine during a REFRESH?
A: CDC uses DB2-CLI API with SQL based batch INSERT operations to load the
target table.

Q: How is data loaded into DB2/400 (iSeries, IBM i) databases by the CDC
target replication engine during a REFRESH?

A: CDC DB2/400 populates the target table file directly using native DB2/400 I/O
operations which avoid the SQL libraries. This method has the highest
performance for loading data into the database.

Q: How is data loaded into other databases by the CDC target replication
engine during a REFRESH?

A: CDC uses the native database bulk load utility to INSERT the refreshed rows
into the target database. This behavior can be changed to use a JDBC SQL
based INSERT operation if the customer decides not to use bulk load. Note
however that bulk load is by far the fastest method of loading refresh data into a
target database in most cases.

Q: What order is used when performing a REFRESH with multiple tables?

A. The order in which each individual table is refreshed is based on the group
order. Group order is set via Management Console. If all tables have the same
group order, then they'll be used as the same order they're stored in the CDC
metadata.

Q: How does CDC support REFRESH for tables with referential integrity?

A. Use the Table Group order facility via Management Console to organize the
order of tables to REFRESH to keep within the constraints imposed on the
tables.
At least with 6.3, I believe refresh of tables with RI is not supported with the
default configuration. The user would have to set some system parameters. See
JIRA JUDB-1275 and JORA-1174

Q: Does archive log only replication, read-only source database or other


CDC features affect the REFRESH operation?

A: There are no features of CDC which prevent restart of mirroring following a


REFRESH.
Q: What method is faster, non-logged bulk load or SQL INSERT based
using JDBC / CLI?

A: CDC will attempt to use the native bulk load interface supported by a particular
database platform and release, Bulk load operations are typically not logged,
and many databases have implemented short cuts to load data within the tables
faster than for SQL based INSERT operations.

Q: When does CDC switch from bulk load to the SQL method?

A: The answer depends on platform, configuration and settings:

The user can disable bulk loading through a system parameter.

When table mapping is set for Live Audit, CDC does not bulk load as the audit
table needs to be appended to and not re-loaded.

The presence of LOB columns will cause CDC target to use JDBC loader on
some platforms such as SQL Server where the bulk load interface does not
support LOB. Other platforms such as Oracle support LOB during OCI
DirectPathLoad.

The JDBC apply will be used if user exits are configured as CDC cannot know if
the user exit is referencing data in the target table which would necessitate that
rows be inserted in transactions and therefore immediately visible to the user exit
code.

The JDBC apply will be selected if target columns have non-ASCII character
names on platforms such a SQL Server where the bulk load interface has such
limitations.

CDC for Informix does not use a bulk loader.

Q: What is the preferred REFRESH loader method on DB2/z platform?

A: Typically the SQL based INSERT method of REFRESH is preferred because it


is easier to configure. The bulk load interface on DB2/z requires configuration in
order to be utilize the LOAD utility. The DB2/z LOAD utility requires that CDC
stage the entire table to a disk file before starting LOAD.

Table sizes may affect load speed depending on method chosen:


For small tables, the SQL method is typically more optimal, versus the overhead
of staging bulk load files and calling the loader.

For medium tables which comfortably fit on DASD, the LOAD method is typically
more optimal than using SQL based INSERTS.

For very large tables, the significant disk resource requirement of staging the
entire LOAD file may not match existing resource availability, and may not
perform significantly better than the default SQL based REFRESH.

Q: Does CDC drop indexes on target tables before loading rows?

A: The answer depends on the platform:

CDC for DB2/z and CDC for DB2/400 do not drop indexes prior to load.

CDC for DB2 UDB by default does not drop indexes, but can be configured by
system parameter to optionally drop and re-create indexes when bulk loading.

CDC for Oracle will drop indexes prior to load and recreate them afterwards.

CDC for SQL Server and CDC for Sybase will drop indexes prior to load and
recreate them afterwards when using bulk loader.

Q: How to reduce REFRESH time for tables with many indexes?

A: CDC may drop indexes on the target table prior to loading rows depending on
the load method available. When REFRESH completes, CDC will recreate any
indexes it had previously dropped one at a time until all indexes are recreated.
CDC then moves on to the next table to REFRESH. To optimize the loading of
multiple tables, and especially those with many indexes, manually drop indexes
except the primary key on the target tables prior to REFRESH. When CDC has
notified that the table has finished a REFRESH operation, manually perform
index re-create outside of CDC product. While some databases provide for
parallel index recreation, this may result in CPU and I/O bottlenecks on the target
database.

Q: How does recursion prevention affect REFRESH in a bi-directional


replication scenario?

A: The CDC target engine writes an entry to the DM_BOOKMARK table when
changing records in the target database. When the CDC recursion prevention
feature is enabled, the CDC log scraper detects when a transaction contains a
change to the DM_BOOKMARK table, and discards these transactions which
originated from CDC. The only time CDC does not write an entry to the
DM_BOOKMARK table during replication is during REFRESH bulk load
operations which are not logged and therefore would not be replicated by CDC.
Q: What is "Differential Refresh" and how does it work?

A: Standard REFRESH method will truncate the target table before bulk loading
rows. Differential method available as of CDC 6.2+ keeps the target table online
during the REFRESH operation.

Mirroring is stopped during refresh (same as standard refresh)


-Available for “standard replication” table mappings
-Target table remains online during refresh (for reporting / other uses)
-Requires same table structure for source and target
-Rows are compared by sort order of “key” columns
-Entire table is processed (no subset)
-All source table rows are sent to target

Repair differences between source and target


-Rows that are on source but not on target
-Rows that are on target but not on source
-Rows that differ in contents between source and target
-Differences can be logged to audit log table
-CDC by default apply merges rows into target table (can be configured
optionally to only audit the changes)
-Uses SQL apply (not bulk load)

User interface
-Management Console option on refresh
-Command line
-Table by table
-User initiated

Q: What additional REFRESH capabilities may become available in future


CDC product releases?

A: CDC source and target (in a future release) may automatically detect and
support the case where during a REFRESH operation on a table, primary keys
are modified by an UPDATE statement on the table. The scenario of UPDATE
changing a primary key value is rare and has only ever been reported by
customers a few times. The workaround is to refresh such tables when the
affected database tables are idle, meaning, no UPDATE on keys being
performed.
A: CDC source and target (in a future release) may support refreshing a subset
of the table via a WHERE clause that can be specified to reduce the amount of
data replicated during REFRESH. This is useful for tables with many partitions,
where only the newest data needs to be refreshed due to an operational issue
related to data which was newly replicated / in-scope.

A: CDC target (in a future release) may contain multiple parallel threads to
improve refresh performance for large tables.