Avoid SELECT : SSIS Performance Improvement

SSIS Performance improvement
1. Avoid SELECT *
The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented
architecture for data transfer and transformation. When data travels from the
source to the destination, the data first comes into the buffer, required
transformations are done in the buffer itself and then written to the destination.
The size of the buffer is dependant on several factors, one of them is the estimated
row size. The estimated row size is determined by summing the maximum size of
all the columns in the row. So the more columns in a row means less number of
rows in a buffer and with more buffer requirements the result is performance
degradation. Hence it is recommended to select only those columns which are
required at destination.
Even if you need all the columns from the source, you should use the column
name specifically in the SELECT statement otherwise it takes another round for
the source to gather meta-data about the columns when you are using SELECT *.
2. Avoid asynchronous transformation (such as Sort Transformation)
wherever possible
Synchronous transformations get a record, process it and pass it to the other

transformation or destination in the sequence. The processing of a record is not
dependent on the other incoming rows. Because synchronous transformations
output the same number of records as the input, it does not require new buffers
(processing is the done in the same incoming buffers i.e. in the same allocated
memory) to be created and because of this it is normally faster. For example, in
the Derived column transformation, it adds a new column in the each incoming
row, but it does not add any additional records to the output.
Unlike synchronous transformations, the asynchronous transformation might
output a different number of records than the input requiring new buffers to be
created. Because an output is dependent on one or more records it is called a
blocking transformation. Depending on the types of blocking it can either be
partially blocking or a fully blocking transformation. For example, the Sort
Transformation is a fully blocking transformation as it requires all the incoming
rows to arrive before processing.
As discussed above, the asynchronous transformation requires addition buffers for
its output and does not utilize the incoming input buffers. It also waits for all
incoming rows to arrive for processing, that's the reason the asynchronous
transformation performs slower and must be avoided wherever possible. For
example, instead of using Sort Transformation you can get sorted results from the
source itself by using ORDER BY clause.
3. DefaultBufferMaxSize and DefaultBufferMaxRows
The number of buffer created is dependent on how many rows fit into a buffer and
how many rows fit into a buffer dependent on few other factors. The first
consideration is the estimated row size, which is the sum of the maximum sizes of
all the columns from the incoming records. The second consideration is the
DefaultBufferMaxSize property of the data flow task. This property specifies the
default maximum size of a buffer. The default value is 10 MB and its upper and
lower boundaries are constrained by two internal properties of SSIS which are
MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a
buffer can be as small as 64 KB and as large as 100 MB. The third factor is,
DefaultBufferMaxRows which is again a property of data flow task which
specifies the default number of rows in a buffer. Its default value is 10000.
Although SSIS does a good job in tuning for these properties in order to create a
optimum number of buffers, if the size exceeds the DefaultBufferMaxSize then it
reduces the rows in the buffer. For better buffer performance you can do two
things. First you can remove unwanted columns from the source and set data type
in each column appropriately, especially if your source is flat file. This will
enable you to accommodate as many rows as possible in the buffer. Second, if
your system has sufficient memory available, you can tune these properties to
have a small number of large buffers, which could improve performance. Beware
if you change the values of these properties to a point where page spooling (see
Best Practices #8) begins, it adversely impacts performance. So before you set a
value for these properties, first thoroughly testing in your environment and set the
values appropriately.
You can enable logging of the BufferSizeTuning event to learn how many rows a
buffer contains and you can monitor "Buffers spooled" performance counter to see
if the SSIS has began page spooling. I will talk more about event logging and
performance counters in my next tips of this series.
4. How DelayValidation property can help you
SSIS uses validation to determine if the package could fail at runtime. SSIS uses
two types of validation. First is package validation (early validation) which
validates the package and all its components before starting the execution of the
package. Second SSIS uses component validation (late validation), which
validates the components of the package once started.
Let's consider a scenario where the first component of the package creates an
object i.e. a temporary table, which is being referenced by the second component
of the package. During package validation, the first component has not yet
executed, so no object has been created causing a package validation failure when
validating the second component. SSIS will throw a validation exception and will
not start the package execution. So how will you get this package running in this
common scenario?
To help you in this scenario, every component has a DelayValidation
(default=FALSE) property. If you set it to TRUE, early validation will be skipped
and the component will be validated only at the component level (late validation)
which is during package execution
5. Better performance with parallel execution
SSIS has been designed to achieve high performance by running the executables
of the package and data flow tasks in parallel. This parallel execution of the SSIS
package executables and data flow tasks can be controlled by two properties
provided by SSIS as discussed below.
MaxConcurrentExecutables - It's the property of the SSIS package and specifies
the number of executables (different tasks inside the package) that can run in
parallel within a package or in other words, the number of threads SSIS runtime
engine can create to execute the executables of the package in parallel. As I
discussed in Best Practice #6,the SSIS runtime time engine which executes the
package and every task defined in it (except data flow task) in the defined
workflow. So as long as you have sequential workflow of the package (one task
after another, precedence defined with precedence constraints between tasks) this
property would not make any difference. But if you have your package workflow
with parallel tasks, this property will make a difference. Its default value is -1,
which means total number of available processors + 2, also if you have hyperthreading enabled then it is total number of logical processors + 2.
EngineThreads - As I said, MaxConcurrentExecutables is a property of the
package and used by SSIS runtime engine for parallel execution of package
executables, likewise data flow tasks have the EngineThreads property which is
used by the data flow pipeline engine and has a default value of 5 in SSIS 2005
and 10 in SSIS 2008. This property specifies the number of source threads (does
data pull from source) and worker thread (does transformation and upload into the
destination) that can be created by data flow pipeline engine to manage the flow of
data and data transformation inside a data flow task, it means if the EngineThreads
has value 5 then up to 5 source threads and also up to 5 worker threads can be
created. Please note, this property is just a suggestion to the data flow pipeline
engine, the pipeline engine may create less or more threads if required.
Example - Let's consider you have a package with 5 data flow tasks in parallel and
MaxConcurrentExecutables property has value 3. When you start the package
execution, the runtime will start executing 3 data flow tasks of the package in
parallel, the moment any of the executing data flow task completes, the execution
of the next waiting data flow task will start and so on. Now what happens inside
the data flow task is controlled by EngineThreads property. As I described, how
the work of a data flow task is broken down into one or more execution trees in
Best Practice #6 (you can see how many execution trees are created for a data
flow task by turning on logging for PipelineExecutionTrees event of the data flow
task), the data flow pipeline engine might create source and worker threads as
many as you have set the value of EngineThreads to execute one or more
execution trees in parallel. Here also if you have set EngineThreads property to 5
and also your data flow task is broken down into 5 execution trees, it does not
mean all execution trees will run in parallel, to summarize this, sequence matters
here as well.
Be very careful while changing these properties, do thorough testing before final
implementation. Because these properties, if properly configured within the
constraints of available system resources, improves the performance by achieving
parallelism; on the other hand if it is poorly configured then it will hurt
performance because of too much context switching from one thread to another.
So as a general rule of thumb do not create and run more threads in parallel than
the number of available processors.
6. How Checkpoint features helps in package restarting
SSIS has a cool new feature called Checkpoint, which allows your package to start
from the last point of failure on next execution. By enabling this feature you can
save a lot of time for successfully executed tasks and start the package execution
from the task which failed in last execution. You can enable this feature for your
package by setting three properties (CheckpointFileName, CheckpointUsage and
SaveCheckpoints ) of the package. Apart from that you need to set
FailPackageOnFailure property to TRUE for all tasks which you want to be
considered in restarting, what I mean here is if you set this property then only on
failure of that task, the package fails, the information is captured in the checkpoint
file and on subsequent execution, the execution starts from that task.
So how does it work? When you enable checkpoint for a package, execution status
is written in the checkpoint file (name and location of this file is specified with
CheckpointFileName property). On subsequent executions, the runtime engine
refers to the checkpoint file to see last execution status before starting the package
execution, if it finds failure in last execution it knows where the last execution
failed and starts the execution from that task only. So if you delete this file before
the subsequent execution, the package execution will start from the beginning
even though the last execution failed as it has no way to identify this.
By enabling this feature, you can save lots of time (as data pull or transformation
on huge data volume takes much time) during subsequent execution by skipping
the tasks which executed successfully in the last run and start from the task which
failed. One very important point to note here, you can enable a task to participate
in checkpoint including data flow task but it does not apply inside the data flow
task. In other words, at data flow task level only you can enable it, you cannot set
checkpoint inside the data flow task, for example at a transformation level inside
it. Let's consider a scenario, you have a data flow task for which you have set
FailPackageOnFailure property to TRUE to participate in checkpoint. Now inside
the data flow task, you have five transformations in sequence and execution fails
at 5th (earlier 4 transformations have completed successfully). On subsequent
execution the execution will start from the data flow task and the first 4
transformations will run again before coming to 5th one.
7. Lookup transformation consideration
In the data warehousing world, it's a frequent requirement to have records from a
source by matching them with a lookup table. To perform this kind of
transformation, SSIS has provides a built-in Lookup transformation.
Lookup transformation has been designed to perform optimally; for example by

default it uses Full Caching mode, in which all reference dataset records are
brought into memory in the beginning (pre-execute phase of the package) and kept
for reference. This way it ensures the lookup operation performs faster and at the
same time it reduces the load on the reference data table as it does not have to
fetch each individual record one by one when required.
Though it sounds great there are some gotchas. First you need to have enough
physical memory for storage of the complete reference dataset, if it runs out of
memory it does not swap the data to the file system and therefore it fails the data
flow task. This mode is recommended if you have enough memory to hold
reference dataset and. your referenced data does not change frequently, in other
words, changes at reference table will not be reflected once data is fetched into
memory.
If you do not have enough memory or the data does change frequently you can
either use Partial caching mode or No Caching mode.
In Partial Caching mode, whenever a record is required it is pulled from the
reference table and kept in memory, with it you can also specify the maximum
amount of memory to be used for caching and if it crosses that limit it removes the
least used records from memory to make room for new records. This mode is
recommended when you have memory constraints and your reference data does
not change frequently.
No Caching mode performs slower as every time it needs a record it pulls from
the reference table and no caching is done except the last row. It is recommended
if you have a large reference dataset and you don't have enough memory to hold it
and also if your reference data is changing frequently and you want the latest data.
More details about how the Lookup transformation works can be found here
whereas lookup enhancements in SSIS 2008 can be found here.
To summarize the recommendations for lookup transformation:
Choose the caching mode wisely after analyzing your environment and after doing
thorough testing
If you are using Partial Caching or No Caching mode, ensure you have an index
on the reference table for better performance.
Instead of directly specifying a reference table in he lookup configuration, you
should use a SELECT statement with only the required columns.
You should use a WHERE clause to filter out all the rows which are not required
for the lookup.
In SSIS 2008, you can save your cache to be shared by different lookup
transformations, data flow tasks and packages, utilize this feature wherever
applicable.
8. Be aware of implicit typecast
When you use Flat File Connection Manager, it treats all the columns as string
[DT_STR] data type. You should convert all the numeric data to appropriate data
type or else it will slow down the performance. You are wondering how? Actually
SSIS uses buffer oriented architecture (refer Best Practice #6 and #7 for more
details on this), it means it pulls the data from the source into the buffers, does the
transformations in the buffers and passes it to the destinations. So as many rows as
SSIS can accommodate in a single buffer, performance will be better. By having
all the columns as string data type you are forcing SSIS to acquire more space in
the buffer for numeric data types also (by treating them as string) and hence
performance degradation.
Tip : Try to fit as many rows as you can into the buffer which will eventually
reduce the number of buffers passing through the SSIS dataflow pipeline engine
and improve overall performance.
9. Merge or Merge Join component requires incoming data to be

sorted. If possible pull a sorted result-set by using ORDER BY
clause at the source instead of using the Sort Transformation.
Though there are times, you will be required to use Sort
transformation for example pulling unsorted data from flat files.
10.
As I said above there are few components which require
data to be sorted as input to them. If your incoming data is
already sorted then you can use the IsSorted property of output
of the source adapter and specify the sort key columns on which
the data is sorted as a hint to these components.
11.
Try to maintain a small number of larger buffers and try to
get as many row as you can into a buffer by removing
unnecessary columns or by tuning DefaultBufferMaxSize and
DefaultBufferMaxRows properties of data flow task (discussed in
Best Practice #7) or by using the appropriate data type of the
column (discussed in Best Practice #18).
12.
If you are on SQL server 2008, you can utilize some of its
features for better performance. For example you can use the
MERGE statement for joining INSERT and UPDATE data in a single
statement while incrementally uploading data (no need for lookup
transformation) and Change Data Capture for incremental data
pulls.
13.
RunInOptimizedMode (default FALSE) property of data flow
task can be set to TRUE to disable columns for letting them flow
down the line if they are not being used by downstream
components of the data flow task. Hence it improves the
performance of the data flow task. The SSIS project also has the
RunInOptimizedMode property, which is applicable at design time
only, which if you set to TRUE ensures all the data flow tasks are
run in optimized mode irrespective of individual settings at the

data flow task level.
14.
Make use of sequence containers to group logical related
tasks into a single group for better visibility and understanding.
15.
By default a task, like Execute SQL task or Data Flow task,
opens a connection when starting and closes it once its execution
completes. If you want to reuse the same connection in multiple
tasks, you can set RetainSameConnection property of connection
manager to TRUE, in that case once the connection is opened it
will stay open so that other tasks can reuse and also in that single
connection you can use transactions spanning multiple tasks even
without requiring the Distributed Transaction Coordinator
windows service. Though you can reuse one connection with
different tasks but you should also ensure you are not keeping
your connection/transaction open for longer.
16.You should understand how protection level setting works for a
package, how it saves data (in encrypted form by using User key
or password) or it does not save data at all and what impact it
has if you move your package from one system to another, refer
here for more details on this.

Avoid SELECT : SSIS Performance Improvement

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Avoid SELECT : SSIS Performance Improvement

Transféré par

Droits d'auteur :

Formats disponibles

SSIS Performance improvement

Synchronous transformations get a record, process it and pass it to the other

3. DefaultBufferMaxSize and DefaultBufferMaxRows

Lookup transformation has been designed to perform optimally; for example by

8. Be aware of implicit typecast

9. Merge or Merge Join component requires incoming data to be

run in optimized mode irrespective of individual settings at the

Vous aimerez peut-être aussi