Académique Documents
Professionnel Documents
Culture Documents
1. Avoid SELECT *
The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented
architecture for data transfer and transformation. When data travels from the
source to the destination, the data first comes into the buffer, required
transformations are done in the buffer itself and then written to the destination.
The size of the buffer is dependant on several factors, one of them is the estimated
row size. The estimated row size is determined by summing the maximum size of
all the columns in the row. So the more columns in a row means less number of
rows in a buffer and with more buffer requirements the result is performance
degradation. Hence it is recommended to select only those columns which are
required at destination.
Even if you need all the columns from the source, you should use the column
name specifically in the SELECT statement otherwise it takes another round for
the source to gather meta-data about the columns when you are using SELECT *.
2. Avoid asynchronous transformation (such as Sort Transformation)
wherever possible
The number of buffer created is dependent on how many rows fit into a buffer and
how many rows fit into a buffer dependent on few other factors. The first
consideration is the estimated row size, which is the sum of the maximum sizes of
all the columns from the incoming records. The second consideration is the
DefaultBufferMaxSize property of the data flow task. This property specifies the
default maximum size of a buffer. The default value is 10 MB and its upper and
lower boundaries are constrained by two internal properties of SSIS which are
MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size of a
buffer can be as small as 64 KB and as large as 100 MB. The third factor is,
DefaultBufferMaxRows which is again a property of data flow task which
specifies the default number of rows in a buffer. Its default value is 10000.
Although SSIS does a good job in tuning for these properties in order to create a
optimum number of buffers, if the size exceeds the DefaultBufferMaxSize then it
reduces the rows in the buffer. For better buffer performance you can do two
things. First you can remove unwanted columns from the source and set data type
in each column appropriately, especially if your source is flat file. This will
enable you to accommodate as many rows as possible in the buffer. Second, if
your system has sufficient memory available, you can tune these properties to
have a small number of large buffers, which could improve performance. Beware
if you change the values of these properties to a point where page spooling (see
Best Practices #8) begins, it adversely impacts performance. So before you set a
value for these properties, first thoroughly testing in your environment and set the
values appropriately.
You can enable logging of the BufferSizeTuning event to learn how many rows a
buffer contains and you can monitor "Buffers spooled" performance counter to see
if the SSIS has began page spooling. I will talk more about event logging and
performance counters in my next tips of this series.
4. How DelayValidation property can help you
SSIS uses validation to determine if the package could fail at runtime. SSIS uses
two types of validation. First is package validation (early validation) which
validates the package and all its components before starting the execution of the
package. Second SSIS uses component validation (late validation), which
validates the components of the package once started.
Let's consider a scenario where the first component of the package creates an
object i.e. a temporary table, which is being referenced by the second component
of the package. During package validation, the first component has not yet
executed, so no object has been created causing a package validation failure when
validating the second component. SSIS will throw a validation exception and will
not start the package execution. So how will you get this package running in this
common scenario?
To help you in this scenario, every component has a DelayValidation
(default=FALSE) property. If you set it to TRUE, early validation will be skipped
and the component will be validated only at the component level (late validation)
which is during package execution
5. Better performance with parallel execution
SSIS has been designed to achieve high performance by running the executables
of the package and data flow tasks in parallel. This parallel execution of the SSIS
package executables and data flow tasks can be controlled by two properties
provided by SSIS as discussed below.
MaxConcurrentExecutables - It's the property of the SSIS package and specifies
the number of executables (different tasks inside the package) that can run in
parallel within a package or in other words, the number of threads SSIS runtime
engine can create to execute the executables of the package in parallel. As I
discussed in Best Practice #6,the SSIS runtime time engine which executes the
package and every task defined in it (except data flow task) in the defined
workflow. So as long as you have sequential workflow of the package (one task
after another, precedence defined with precedence constraints between tasks) this
property would not make any difference. But if you have your package workflow
with parallel tasks, this property will make a difference. Its default value is -1,
which means total number of available processors + 2, also if you have hyperthreading enabled then it is total number of logical processors + 2.
EngineThreads - As I said, MaxConcurrentExecutables is a property of the
package and used by SSIS runtime engine for parallel execution of package
executables, likewise data flow tasks have the EngineThreads property which is
used by the data flow pipeline engine and has a default value of 5 in SSIS 2005
and 10 in SSIS 2008. This property specifies the number of source threads (does
data pull from source) and worker thread (does transformation and upload into the
destination) that can be created by data flow pipeline engine to manage the flow of
data and data transformation inside a data flow task, it means if the EngineThreads
has value 5 then up to 5 source threads and also up to 5 worker threads can be
created. Please note, this property is just a suggestion to the data flow pipeline
engine, the pipeline engine may create less or more threads if required.
Example - Let's consider you have a package with 5 data flow tasks in parallel and
MaxConcurrentExecutables property has value 3. When you start the package
execution, the runtime will start executing 3 data flow tasks of the package in
parallel, the moment any of the executing data flow task completes, the execution
of the next waiting data flow task will start and so on. Now what happens inside
the data flow task is controlled by EngineThreads property. As I described, how
the work of a data flow task is broken down into one or more execution trees in
Best Practice #6 (you can see how many execution trees are created for a data
flow task by turning on logging for PipelineExecutionTrees event of the data flow
task), the data flow pipeline engine might create source and worker threads as
many as you have set the value of EngineThreads to execute one or more
execution trees in parallel. Here also if you have set EngineThreads property to 5
and also your data flow task is broken down into 5 execution trees, it does not
mean all execution trees will run in parallel, to summarize this, sequence matters
here as well.
Be very careful while changing these properties, do thorough testing before final
implementation. Because these properties, if properly configured within the
constraints of available system resources, improves the performance by achieving
parallelism; on the other hand if it is poorly configured then it will hurt
performance because of too much context switching from one thread to another.
So as a general rule of thumb do not create and run more threads in parallel than
the number of available processors.
6. How Checkpoint features helps in package restarting
SSIS has a cool new feature called Checkpoint, which allows your package to start
from the last point of failure on next execution. By enabling this feature you can
save a lot of time for successfully executed tasks and start the package execution
from the task which failed in last execution. You can enable this feature for your
package by setting three properties (CheckpointFileName, CheckpointUsage and
SaveCheckpoints ) of the package. Apart from that you need to set
FailPackageOnFailure property to TRUE for all tasks which you want to be
considered in restarting, what I mean here is if you set this property then only on
failure of that task, the package fails, the information is captured in the checkpoint
file and on subsequent execution, the execution starts from that task.
So how does it work? When you enable checkpoint for a package, execution status
is written in the checkpoint file (name and location of this file is specified with
CheckpointFileName property). On subsequent executions, the runtime engine
refers to the checkpoint file to see last execution status before starting the package
execution, if it finds failure in last execution it knows where the last execution
failed and starts the execution from that task only. So if you delete this file before
the subsequent execution, the package execution will start from the beginning
even though the last execution failed as it has no way to identify this.
By enabling this feature, you can save lots of time (as data pull or transformation
on huge data volume takes much time) during subsequent execution by skipping
the tasks which executed successfully in the last run and start from the task which
failed. One very important point to note here, you can enable a task to participate
in checkpoint including data flow task but it does not apply inside the data flow
task. In other words, at data flow task level only you can enable it, you cannot set
checkpoint inside the data flow task, for example at a transformation level inside
it. Let's consider a scenario, you have a data flow task for which you have set
FailPackageOnFailure property to TRUE to participate in checkpoint. Now inside
the data flow task, you have five transformations in sequence and execution fails
at 5th (earlier 4 transformations have completed successfully). On subsequent
execution the execution will start from the data flow task and the first 4
transformations will run again before coming to 5th one.
7. Lookup transformation consideration
In the data warehousing world, it's a frequent requirement to have records from a
source by matching them with a lookup table. To perform this kind of
transformation, SSIS has provides a built-in Lookup transformation.
When you use Flat File Connection Manager, it treats all the columns as string
[DT_STR] data type. You should convert all the numeric data to appropriate data
type or else it will slow down the performance. You are wondering how? Actually
SSIS uses buffer oriented architecture (refer Best Practice #6 and #7 for more
details on this), it means it pulls the data from the source into the buffers, does the
transformations in the buffers and passes it to the destinations. So as many rows as
SSIS can accommodate in a single buffer, performance will be better. By having
all the columns as string data type you are forcing SSIS to acquire more space in
the buffer for numeric data types also (by treating them as string) and hence
performance degradation.
Tip : Try to fit as many rows as you can into the buffer which will eventually
reduce the number of buffers passing through the SSIS dataflow pipeline engine
and improve overall performance.