Vous êtes sur la page 1sur 30

http://blog.iadvise.

eu/tag/etl/

Talend: Schema compatibility check


Posted on October 8, 2014 by Jessica Smets

Most of the time when talking about Talend jobs, people think of standard
ETL (Extract, Transform, Load). But in some cases theres the need to check
the incoming data before loading them into the target rather than just
transforming it. We refer to this process as E-DQ-L
(Extract, Data Quality, Load).
One of the things that you might want to check before loading is schema
compatibility. For example: you expect to get a String thats 5 long. If you, for
any reason, receive a String that is larger than 5, it will generate an error. Or
perhaps you expect a percent (in format BigDecimal like 0.19), but you
receive it as a string (19%). This example will result into a failing job with
an error saying Type mismatch: cannot convert
from dataType to otherDataType.
Before I continue this blog I would like to emphasize that all the solutions
below are possible with the Data Integration version of Talend, except for the
last one. The last option requires a Talend Data Quality license.
Lets create an example case: We want to extract data on a regular basis
from a third-party source which we cannot fully trust in terms of schemasettings. We know how many columns we can expect and we have a rough
idea of what it contains, but we do not fully trust the source to not give
incompatible data. We want to load the records that are valid and we want to
separately store the corrupt data for logging purposes. Ive gathered
several solutions for this problem:
1. Use rejected flow on an input-component
One thing you can do is reject the records as soon as you import them.
Disable die on error on the basic settings tab of you input-component and

then right-click it and select Reject. The rows will be rejected based on the
schema of the file. In the example below we put phone number as an integer
and as you can see 1 records is begin rejected. This is because the phone
number contains characters and therefore cannot be read as an integer. If
you did not disable the die on error-option then this component would
make the job fail.

2. In case of the target being a database: use rejected links


You can also choose to directly input the data into your database, but to
reject any rows that would create an error. You can then create a separate
flow to determine what to do with these rejected records.
In your database output component (for example tOracleOutput) change the
following:

Basic settings: Uncheck Die on error

Advanced settings: Uncheck Use batch size

Now, right-click on your component and select Row-Reject and connect it


to an output-component. The output youll receive will be the rejected rows
and what error would have been generated if you tried inserting them, as
you can see in the picture below.

3. Use a tFilter-component
You can make the data go through a filter-component before inserting it into
your target. You can (manually) decide whats allowed to go through. This
can be useful when your destination is not a database, in which case option
1 is most likely not available.

A tFilterRow-component also has the possibility to output the rejected rows,


including the reason why they got rejected. You can enable this by right-

clicking on your filter and selecting Row-Reject. An example of rejected


rows by the filter:

Note You can also use self-defined routines in the tFilterRow-component by


checking Use advanced mode. This can be useful when you want to check
whether or not converting is possible. For example: you could define a
routine called isInterger that returns true if the conversion is valid and
false if its impossible.
4. Use a tSchemaComplianceCheck-component
Another way of making sure that your schema is compatible is by using the
tSchemaComplianceCheck-component. Unfortunately, this component is only
integrated in the Data Quality version of Talend.
Its a very easy component to use. The only thing you have to do is connect
the incoming data to the tSchemaComplianceCheck-component and then
continue its flow to the destination source. You can get the rejected rows the
same way as previously (by right clicking on it and then selecting Row>Reject).

The rejected rows and their error message look like this:

Thats it for now. Theres probably a lot of other ways of checking schema
compatibility. Feel free to comment if you know any. Thank you for reading!
Posted in Talend | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part 2


Posted on August 26, 2014 by Jessica Smets

In the first part of these entries we discussed how to test your expressions,
the importance of optimizing the appearance of a tLogRow component and
how to handle windows and views within Talend. This time around, we will be
talking about the different ways to get components into your job, how to
trace your dataflow and how to easily sync columns. As last time, this post
will be useful for both starting and experienced users.
4. Getting components into your job
There are many ways to get components into your job. Most people search
the palette (by either the search-function or by manually exploring the
folders) and drag/drop the components into their job. You can achieve the
same thing by simply clicking on a random place in your job and then type
the name of the component. Obviously this is only recommended once
youre familiar with the different components and their names.

When working with metadata, you can use certain shortcuts to save a bit of
time. Usually people just click on the metadata and then drop it onto their
job. This will pop up a window allowing you to choose which type of
component you want to use. Holding the Control-key while dragging the
component will directly create an Output-component. Holding Control+Shift
will result into an Input-component.
5. Syncing columns
Occasionally, you may have to change the schema of a certain component in
the middle of development. This might affect other components in your job.
In some cases, Talend asks if you want to propagate the changes youve
made (to the other components).

You may accidently close this window, click No or not get this message at
all, resulting in the following error: The schema from the input
link youroutputlinkis different from the schema defined in the component.

When this happens, you can go to the basic settings of the component that
has the error and click on Sync columns. The error should now be gone.

6. Tracing your dataflow (Debug Run)


Lastly, I would like to say a few words about the debug run. In some cases
we want to closely watch our dataflow in order to get a better understanding
of whats exactly happening. You can achieve this by running your job in
debug mode. This can be done by clicking on the Run-window, then click on
the Debug Run tab on the left side of the window and start it by clicking on
Traces Debug.

The moment you open the Debug run tab, youll immediately see extra
icons in your job. These magnifying glass icons indicate that details will be
shown when you debug-run your job. The result should look something like
this:

You can Pause and Resume the run at any time. You can also add breakpoints
if you like. Do this by right-clicking on a dataflow and then selecting Show
Breakpoint Setup.

This brings you to the Breakpoint tab of the data flow you clicked on. You
can also go there by clicking on the specific flow and manually selecting
Breakpoint. Lets add a breakpoint to pause our run whenever we come

across a record with Bloom as last name. Firstly, make sure to check the
Activate conditional breakpoint option. After that, click on the plus-icon
underneath the conditions. Then select the InputColumn we want to put our
condition on, in our case this is Last_name, and add a value (Bloom in
this example). The default Operation is Equals, which is the one we want.
You can also specify an Operation if you need to, but this is unnecessary for
this case.

You can add multiple breakpoints if you like. Whenever you debug run your
job now, it will stop at a record where the Last_name is Bloom (if any
exist).
Thats it for now. Thank you for reading!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part 1


Posted on August 4, 2014 by Jessica Smets

This blog contains some convenient tips and tricks that will make working
with the open source tool Talend for data integration a lot more efficient. This
blogpost will be especially useful for people who are just discovering this
amazing tool, yet I am sure that people who have been using it for a while
will also find it very helpful. These series of tips will be spread over multiple
blog entries so make sure to check back often for future tips!
1. Testing expressions in the tMap component
Using the tMap component, you have the possibility to test your expressions.
This way you can easily see whether or not the result is what you expected it

to be. You can also use this to determine whether or not your expression will
error. Lets create an example.
Weve got details of employees as input for our tMap. We would like the first
name to be shown in uppercase. First of all, go into the expression builder by
clicking the ellipsis next to your expression.

To convert the first name to uppercase, we have to use the StringHandling


function UPCASE. This will result in the following
expression:StringHandling.UPCASE(employee.First_name)
After youre done filling in test values, click on the Test! button and wait for
the result. If everything goes as expected, you should see your first name in
uppercase on the right side of the window.

2. Optimizing the appearance of the tLogRow component output

tLogRow is one of the most frequently used components. It is recommended


that you learn how to optimize its use. Firstly, make sure that you always
have the right appearance selected for your output. You can find this
property in the basic settings of your tLogRow-component.

There are three types of Modes that you can choose between:

Basic

Basic will generate a new line for each record, separated by the Field
Separator youve chosen (see image above). When using basic mode, I
highly recommend to check the Print header option when working with
multiple column records or multiple outputs, purely for visibility reasons.

Table (print values in cells of a table)

The table mode shows the records and their headers in a table-format,
including the name of the component that generated this output (in our
case: tLogRow_1). This emphasizes the importance of properly naming
everything, especially when you have multiple components that generate
output. In this case, it would have been better to rename our component to
EMPLOYEES. Personally, I prefer this mode.

Vertical (each row is a key value/list)

Vertical mode will show a table for each one of your records.

The output mode you decide to use depends on what youre trying to
visualize. For example, when your goal is to show a single string, I would
recommend using the basic mode. But when you have multiple table outputs
(for example: departments, customers and employees in a single output), Im
certain the table mode would be the best option.
Sometimes your data is spread over multiple lines, resulting in an unclear
output, like shown in the image below.

To force the output to put all the data on one single line, you can uncheck the
Wrap option. This option is located underneath your output and will enable
a horizontal scrollbar.

Do you also want to be able to get data regarding tweets using Talend, as
shown in the image above? Read my previous blogpost and find out how!
3. Resetting windows and maximizing/minimizing them
Sometimes you accidently close a window and have a hard time finding a
way to get it back. You can very easily reset your environment by clicking on
Window Reset Perspective.

You can see all of the views by clicking on Windows Show View
Talend. Some of the views are not shown by default, such as Modules.
Modules can be used to import .jar-files without having to restart your studio,
which will most likely save you some time.
Lastly, because Talend is Eclipse-based, you have the possibility to maximize
and minimize windows. I personally use this function when examining the
output of a tLogRow-component including a lot of data. You can achieve this
by either double-clicking on the window or by right-clicking on it and
selecting Minimize/Maximize.
Thats it for now. I hope you enjoyed reading this blog and make sure to
return soon for future blogs!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Use of contexts within Talend


Posted on May 27, 2014 by Dieter Van Ransbeek

When developing jobs in Talend, its sometimes necessary to run them on


different environments. For other business cases, you need to pass values

between multiple sub-jobs in a project. To solve this kind of issues, Talend


introduced the notion of contexts.
In this blogpost we elaborate on the usage of contexts for easily switching
between a development and a production environment by storing the
connection data in context variables. This allows you to determine on which
environment the job should run, at runtime, without having to recompile or
modify your project.
To start using contexts in Talend you have two possible scenarios:
1) you can create a new context group and its corresponding context
variables manually, or
2) you can export an existing connection as a context.
In this example well go over exporting an existing Oracle connection as a
context.
Double click an existing database connection to edit it and click Next.
ClickExport as context

NOTE There are some connections that dont allow you to export them as a
context. In that case youll have to create the context group and its variables
manually, add the group/variables to your job, and use the variables in the
properties of the components of your job.

After youve clicked the Export as context button youll see the Create/Edit
context group screen. Enter a name, purpose and description and click Next.

Now youll see all the context variables that belong to this context group.
Notice that Talend has already created all the context variables that are
needed for the HR connection. If you want to change their names you can
simply click them and they become editable.
Click the Values as table tab.

In the Values as table tab you can edit the values of the context variables by
simply clicking the value and changing it. To add a new context, click the
context symbol in the upper right corner.

The window that pops up is used to manage contexts. To create a new


context, click New, enter the name of the context, in our
example Production, and clickOk. To rename the Default context, select it,
click Edit, enter Development and click Ok. When youre done editing,
click Ok.

After the window closes, youll see that an extra column appeared. Enter the
connection data of the production environment in the Production column and
click Finish.

In the connection window its possible to check the connection again, but this
time youll be prompted which connection you want to check.

Verify that both the connections work and click Finish.


Now that weve exported the connection as a context, its possible to use it
in a job. Create a new job, use the connection that has been exported as a
context and connect it to a tLogRow component. Your job should look
something like this

When using a connection that has been exported as a context in a job, you
have to include the context variables in order for your job to be able to run.
Go to the context tab and click the context button in the bottom left.
NOTE When using one of the newer versions, Talend proposes to add missing
context variables whenever you try to run a job, because of this you dont
need to add them manually as described in this example.

Select the context group that contains the context variables, in our case the
HR context group.

Select the contexts you want to include and click OK

NOTE A context group can also be added to a job by simply selecting the
context from the repository, dragging it towards the context tab of the job,
and dropping it there.
Once youve added the context group to the job, its possible to run the job
for both the development and production environment by selecting the
context in the dropdown menu of the Run tab.

Posted in data integration, ETL, Talend | Tagged Contexts, ETL, Talend | 1 Comment

Connecting to Salesforce and


Mailchimp using Talend
Posted on November 25, 2013 by liesbethvanraemdonck

A lot of companies use Salesforce to manage their customers and contacts.


In addition Mailchimp can be used for sending out mailings to these
connections. Mailchimp also captures information about what people did with
these mails. This can be useful information for your CRM. A while ago, I was
asked to make a list of everyone that have opened their mails in
Mailchimp. Let me show you how easy it is, to do something like that with
Talend.
In Talend:

we can get a list of email addresses from Mailchimp of receivers that


opened a mail

and we can ask Salesforce for the email addresses and names of all
our connections

and we can also use a mapping component to join these lists.

Talend has a standard interface with Salesforce. And Mailchimp offers lots of
RESTful web services, which we can make use of in our Talend job.
1. Connecting to Salesforce
Right click Salesforce under the Metadata and choose Create Salesforce
Connection.

After choosing a name for our connection, all we need to fill in, is the
username and password for our Salesforce-connection. The rest is already
filled in for us.

To enable the Finish button, we need to check our properties first, using the
button Check login.
Under Metadata, we can now browse through all our Salesforce-data.

Now youre probably wondering, how to use this data in your ETL-flow. Well..
thats even easier!
Simply drag one of the tables (with the blue icons) into your job and choose
for the tSalesforceInput component from its 3 suggestions.

After specifying the necessary mappings you should get something like this:

Weve used Contact and Account data of Salesforce for this.


In the next part, lets check out how we generated the list of email
addresses.
2.

Connecting to Mailchimp

Accessing your Mailchimp-data, is a bit harder. We need two components


from the Talend-palette:
The tRest component, because we need to use a RESTful webservice for
requesting our data from Mailchimp. And the tExtractJSONFields component
for interpreting the data we receive back.
After dragging the tRest component to your job, choose POST as the
method and fill in the URL, corresponding to the report you wish to receive.

If you want to receive your report in XML-format instead of JSON, just add
.xml at the end of the URL.
Here we needed the Mailchimp report, that gives us information on opened
emails.
If you are interested in other kinds of reports, you can find the list here:
http://apidocs.mailchimp.com/api/2.0/#lists-methods
Every request, needs certain parameters. We can specify them in the HTTP
body field, like this:
{\apikey\: \your api key will be here\,\cid\: \put a campaign id
here\}
The API-key will always be needed as the first parameter. You can find it in
Mailchimp under your Account Settings Extras .

The second component we need, is called ExtractJSONFields. After dragging


it to our job, we link our first component to it.

We can use Edit schema, to define the data we want to extract.

Finally all we need to do, is specify the location of this data we are interested
in, for example the email-field inside the member-field.

Now that were able to access our data from Mailchimp, lets take a look at
how we used it for generating the list of e-mailaddresses.

First we asked Mailchimp for all our Campaigns, then we used the
flowToIterate-component so we could ask Mailchimp for the email
addresses, once for every campaign in the list:

Finally all we had to do, is put these two jobs together and press run.
So.. I hope youll enjoy it, as much as I did!

Vous aimerez peut-être aussi