Best Practice Data Loads HANA

1) Build the right architecture
Every technology has limits, and HANA is constrained by the same laws of physics as
any other computer system. The major differences with HANA are:

a) Data is kept entirely in-memory and therefore accessing all data has the same
latency
b) Data is compressed in-memory and only uncompressed in the CPU itself
c) Data may be loaded into all cores (up to 80) simultaneously and a single instruction
executed against all at the same time (massively parallel)

If you think about all this, then by far the major challenge with HANA is getting optimized
data in fast enough. Networks bottom out at 300MB/sec, disks are often even slower
and if you need to load HANA data fast, then you need to think carefully about your
design.

2) Optimizing the data model

HANA doesn't work like Oracle, so you need to adjust your thinking. Here are the
golden rules for HANA tables:

Keep tables as narrow as possible. HANA has a cost for columnar inserts, so you get a
reduction in data load performance with an increase in the number of columns. Don't
build out fields you don't need. Interestingly, very wide tables have lower throughput in
MB/sec than narrower tables. Once you get over 80-100 columns, you will see up to a
25% degradation in throughput.
Consider your partitioning strategy. On a single node appliance 3-5 partitions is
normally a good number for optimal insert performance. On a scale-out appliance, you
can multiply this by the number of active nodes over which you distribute the tables.
Avoid using keys if you don't need them. Actually, avoid using the HANA database for
referential integrity if you can, it is expensive.
Avoid using VARCHARs if you don't need them. Fixed width CHAR is better for smaller
fields, and INTEGER fields are even better. VARCHARs have a cost, and don't
compress as well.

Good Bad
CREATE COLUMN TABLE "SCHEMA"."TABLE" (
"TIME" TIME,
"FIELD1" CHAR(1),
"FIELD2" CHAR(16),
"FIELD3" CHAR(4),
"MEASURE1" INT,
"MEASURE2" DECIMAL(7,4))
PARTITION BY ROUNDROBIN PARTITIONS 12;
"TIME" TIMESTAMP,
"FIELD1" VARCHAR(255),
"MEASURE1" BIGINT,
"MEASURE2" DECIMAL(15,7)
"UNUSED_FIELD" VARCHAR(2000));

Note that in my test example here, the table on the left loads 6x faster. Note that I use
12 partitions because I have a 4-node environment with one master node, and the table
partitions distributed over the remaining nodes - 4 partitions in each of 3 nodes. This
gives maximum table distribution and therefore throughput.

Note that merge deltas operate better with multiple partitions. But if you really want to
learn about table partitioning, read "Table Partitioning in the SAP HANA database"
section in the "SAP HANA Administration guide", here: SAP HANA Appliance Software
SAP Help Portal Page

3) Get the upper bound by optimizing a control file load

I always test control file loads with HANA first, using CSV files. If your files are in the
right format then control files are always the fastest way to load into HANA. They were
written by the bulk loader team and you can get amazing load performance. This is how
I do it:

a) Turn off database logging

On a single-node appliance, or a scale-out appliance like IBM or Hitachi that uses
Fusion-IO for logs, you will get the same performance with and without logging enabled.
The only difference is you will fill up the log disk and crash the database. Turn them off
until you do your final testing.

You do this by double clicking on your database node on the left hand side of HANA
Studio, click Configuration -> global.ini -> persistence. Set enable_auto_log_backup to
no and log_mode to overwrite.

b) Get the file in CSV format with the minimum number of columns

I usually use bash script to do this, with a combination of grep, awk, sed, cut and even
perl, if it gets ugly. You can convert fixed format, or any format, to nice clean CSV files.
Perl even has a CSV library that you can use to do this. Here's a neat example that
sucks in your CSV and gives you the number of rows, columns and the maximum length
of a column.

#!/bin/bash
typeset -a a END
let a=1
COLUMNS=`awk 'BEGIN {FS=","} ; END{print NF}' test.csv`
ROWS=`wc -l < test.csv`
echo $ROWS ' rows'
echo $COLUMNS ' columns'
while ((a
echo -n $a
cut -f $a -d\, < test.csv | awk '{ if (length($0) > max) {max = length($0);
maxline = $0} } END { print maxline }'| wc -c
let a++
done

Quite often, I create demo data in my lab based on a customer's schema. I use one of a
few tricks to do this:

i) Excel

Excel is great for this and you can create a million rows at a time using tricky functions
like this (put your values, weighted, in Sheet 2 Column A):

=INDEX(Sheet2!$A:$A,RANDBETWEEN(1,COUNTA(Sheet2!$A:$A)),1)

Then I copy the CSV file to the HANA appliance and run a bunch of commands like:

for a in {1..9} do cat test.csv |sed s/2013-02-04/2013-02-0$a/; done >>
testbig.csv
for a in {10..28} do cat test.csv |sed s/2013/02-04/2013-02-$a/; done >>
testbig.csv

This allows me to explode a 1m row file to as many rows as I like.

ii) Data Generator

For more complex files I use a tool called DTM Data Generator. This allows me to
generate complex files at about 10,000 rows/sec to a CSV file (don't use it to generate
data directly into HANA, performance sucks). Overnight I can generate about 360m
rows, which is enough for testing. It costs $149 for the Standard Edition, which allows
10m rows per job. So I create the job I want and duplicate it 50 times. If I wasn't so
cheap, I'd buy the Enterprise Edition at $349, which allows 2bn rows per job and
multithreads.

c) Time the load of the file to /dev/null

# du -m table.csv
30358 table.csv

# time cat table.csv > /dev/null
real 0m9.667s

So I've put this 346m row, 30GB file on a Fusion IO disk and I can read it at
3140MB/sec (or 35.8m rows/sec). I know now that this is the absolute maximum
possible to ingest into HANA. Bear in mind that in the real world, it's really hard to move
data around at more than 100MB/sec - networks are slow.

d) Build a control file

Now I create a control file, like this:

cat > /sapmnt/log/test.ctl
import data
into table SCHEMA.TABLE
from '/sapmnt/log/test.csv'
record delimited by '\n'
field delimited by ','
optionally enclosed by '"'
error log /sapmnt/log/test.bad

And what's great is we can iterate over a) the number of partitions b) the number of
threads c) the batch size

e) Make sure the load runs well

Make sure you aren't generating lines in the BAD file, because this slows downloading
substantially. The data needs to be clean. Fix it with the UNIX tools above (usually sed,
cut, awk, grep) if necessary.

f) Iterate, iterate, iterate

I make sure my CSV file has enough data to make HANA think (20-30GB is a good
amount) and I iterate. For my example, my 346m row file is 30358MB (30GB). With the
default options (1 partition, 1 thread, default batch size), the load happens in 1 hour, 35
minutes 30 seconds. This is not HANA speed!

So I guess a rough set of parameters:

5 partitions per active table node (15 partitions in my example)
The same number of threads I have CPUs* active nodes (120 threads in my example)
Batch size depending on the width of the column. For a simple table like this, I run with
200000. For much wider tables with 50-100 columns, I start with 20000.

With the SQL below, I get improved performance down from 5730s to 88s. That's
more like it, and now we start tweaking to get the optimal settings for those 3
parameters via iteration. I expect this to improve performance a further 30% in most
cases. You may not think it's worth it, but that extra 30% matters to me.

DROP TABLE "SCHEMA"."TABLE";
"TIME" TIME,
"FIELD1" CHAR(1),
"FIELD2" CHAR(16),
"FIELD3" CHAR(4),
"MEASURE1" INT,
"MEASURE2" DECIMAL(7,4))
PARTITION BY ROUNDROBIN PARTITIONS 15;
ALTER TABLE "SCHEMA"."TABLE" DISABLE AUTOMERGE;
IMPORT FROM '/sapmnt/log/test.ctl' WITH THREADS 120 BATCH 200000;
MERGE DELTA OF "SCHEMA"."TABLE";
ALTER TABLE "SCHEMA"."TABLE ENABLE AUTOMERGE;

And I run it lots of times. I iterate on threads, then partition, then batch, and sometimes
again. I build a table with load times and tune it until I feel like I've got HANA speed. The
key is to copy and paste this statement 50 times into a notepad file and then change the
parameters for each load. Execute it as one long SQL statement and leave it running
overnight, loading, dropping and loading 50 times or more over.

What's HANA speed? In my example down from 5730 seconds to 60 seconds - 95x
performance increase. Nearly 6m fully committed rows/second and 500MB/sec
throughput. On a single-node Medium appliance with 40 cores, you should expect
around 275MB/sec of throughput into HANA. In any case we are around the throughput
of 10GBe Ethernet.

Anyhow in most cases, control file loads aren't a good way to load into HANA long term,
because they lack flexibility and the ability to process bad data. But now you have an
upper bound for how fast you can ingest. The Data Services team tells me they look to
get performance that approaches the HANA bulk loader, for example.

4) Design the right Architecture

Now you know enough about your data to design the right loading architecture for
HANA. Note one really important thing: I like to think of HANA as a hungry caterpillar
(reminds me of The Cure song Lullaby, but anyhow). How fast you get data into HANA
is almost always limited, in a well designed environment, by how fast you can supply.

This means that you need fast storage where you are residing the data to be loaded,
fast applications server threads to process it, and fast plumbing - always 10GB Ethernet
- to move data around. If you don't have all of this, your HANA in-memory appliance will
go to waste. Don't think you can run Data Services on a VMWare farm, for example,
and expect to move data around at more than 80MB/sec.

From there, the choice of tool, or tools, is critical. Use this table as a rule of thumb.

Use Case Control File Data Services LT SRS ESP
Benchmarking and PoCs X O O O O
Batch Loads O X - - O
Transformations - X O O X
Fixing Data Quality - X - - O
Streaming Data - - - - X
Real-Time Replication - O X O O
Replication From SAP - O X O O
Tell me what else you want to see here!

X - Core purpose
O - offers the functionality
"-" - does not offer the functionality.

a) Control Files

As I've said before, Control Files are great for benchmarking and PoCs. And for beating
up other ETL teams to get near their performance. After that, you aren't likely to use
them much in a productive system.

b) Data Services

Data Services is a great general purpose batch tool. It offers real-time web services but
there are better options if you want real-time. It also allows great data transformation,
matching and enriching, but every additional transform you add to your batch job will
slow it down, so be careful.

In addition if you are loading files from weird formats, like fixed format files, then Data
Services is great. There are lots of tips and tricks for SAP HANA to be found at
http://wiki.sdn.sap.com/wiki/display/EIM/Data+Services but there are 3 golden rules I
apply:

i) Do major transforms on the way into SAP HANA, not in SAP HANA. This advice will
change because they are integrating the Data Services ETL engine into SAP HANA,
but for now, get your data formats right (e.g. combine date and time into timestamp) on
the way in, rather than using materialized views in SAP HANA, which is expensive.

ii) Dial the 3 main parameters (Parallel Process threads in the file loader, bulk loading
on the target table, commit size in the HANA table loader and loaders in the HANA table
loader) in for your scenario. I start with 40 file loader threads, 10,000 commit size and 4
HANA loaders and tweak it from there. Data Services takes care of optimized merge
deltas, which is very cool.

iii) Make sure you have plenty of power on the Data Services box - you can easily
consume 16-20 cores on x86 CPU, make sure the threads are fast, and make sure you
have 10GBe Ethernet from HANA -> Data Services.

c) SAP Landscape Transformation (LT)

LT is a very neat solution for replicating tables from SAP systems. It is trigger-based so
there is a small overhead, and it is therefore suited to low-mid volume transactional
tables like sales orders. It can be used to replicate tables from non-SAP databases like
Oracle and this makes sense in a SAP shop. If you're not a SAP shop then you wouldn't
use LT because it requires a SAP NetWeaver platform (either standalone for LT, or built
into another SAP product like the Business Suite).

In these scenarios, LT is easy to set up to replicate tables and this is especially cool
with SAP's Rapid Deployment Solutions (RDS), which provide pre-packaged HANA
content that consume database tables from LT.

The main things to note with LT are the need for a Unicode LT system (can be used
against non-Unicode source ERP) and the need to patch your ERP system to a recent
kernel, which may trigger a regression test for your production environment. Also note
that ERP, LT and HANA must be on a dedicated LT 10GBe Ethernet network for
acceptable latency.

d) Sybase Replication Server (SRS)

SRS was originally the replication technology of choice for SAP ERP scenarios, before
the advent of LT. It has several benefits over LT, because it scrapes database logs,
which has lower latency and overhead than the trigger-based LT. Stock Exchanges in
Capital Markets use SRS to replicate trading data between continents, it's that fast.

You do however need to have a full Enterprise license for your source DB, and many
SAP customers buy a restricted runtime license via the Software Application Value
(SAV) program, which allows them to rent their Oracle or IBM database for 8-15% of
their Application Value. If you fall into this category, then you cannot use SRS legally.

In addition, because it scrapes database logs, SRS has some technical dependencies
like Unicode, certain versions and certain databases.

But if what you want is a high-performance sidecar of an existing transactional database
into SAP HANA, then SRS is a very interesting product, if niche.

e) Sybase Event Stream Processing (ESP)

ESP is the dark horse of the ETL family because it can do everything, and it can do
some things very well. Its core purpose is stream processing, and you can look at the
number of different streams it supports here: SyBooks Online - including Tibco
Rendezvous, Capital Markets streams like Wombat and FIX, and files.

Its flexibility is excellent, and you can read fixed-format, CSV, FIX, XML or any other
kind of data and output it to a stream, or to a database or file adapter. What's more,
ESP is being integrated into the HANA database so you will be able to do all of this in-
memory. ESP is low-latency and can shift a very large amount of messages a second (I
hear millions), depending on the hardware and networks you have at your disposal.
What's more you can store a buffer of data in memory for doing time stream processing.

As with any stream processor, when you hit saturation, you will start to page to RAM for
incoming or outbound messages, then when RAM is filled up, to disk, and then bad
things happen. As with Data Services, you can adapt the number of inbound and
outbound feeds to get the performance you need, and adjust the output batch
accordingly. Note that ESP is very well integrated with the SAP HANA ODBC interface
and you can expect to get excellent performance.

5) Benchmark and iterate

Now you have chosen a product for your loading, you need to benchmark and iterate to
try and get as close to the control file performance as possible.

In my tests, I found that the bulk loader performed about 2x as well as anything else, but
this is sharing CPU capacity with SAP HANA. I expect that by putting Data Services or
ESP onto a separate system with low-latency, high-bandwidth 10GBe network, you can
get close to 70-80% of bulk loading performance.

Conclusions

A number of people I spoke to suggested that a set of best practices for data loads
wasn't possible, because every customer is different. I'd counter that by saying that a
set of best practices is required, to understand which tool should be used and when,
depending on a potentially complex set of criteria. My first conclusion is I believe it is
simple to select the right tool with some thought.

The second learning I have is that each data set performs slightly differently with HANA
because of the complexities of massive parallel processing, columnar compression and
column widths and complexities, plus the need to transform data. Once you have
selected the correct tool, it is necessary to iterate to get optimal settings - and iterating
is well worth it.

The third learning is to make sure you do not compare HANA bulk loading to Oracle
bulk loading. HANA bulk loading is fully report-ready data. Oracle bulk loading literally
dumps data into a database row, with no indexes, aggregates or save point. It breaks
the ACID properties of the database until a backup is made. There is no point of making
this comparison.

And a big shout out to:

The IBM X5 appliance that I use for testing. It is a 4x512GB = 2TB, 160 core appliance
in a 22U rack.

All my contacts at SAP who have assisted in this, including Werner Daehn, David
Rosenblum, Werner Steyn, Stefan Sigg, Keith Wood and Tom Ziek.

Lloyd Palfrey at Bluefin who helped out on the HANA appliance side and Kevin Turnbull
and team for creating virtual machines at short notice and generally putting up with me.

Best Practice Data Loads HANA

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Best Practice Data Loads HANA

Transféré par

Droits d'auteur :

Formats disponibles

1) Build the right architecture

Vous aimerez peut-être aussi