Vous êtes sur la page 1sur 36

BI Forum 2013 Master Class

Introduction to Oracle Data


Integrator (ODI)
Mark Rittman, Technical Director, Rittman Mead

T : +1 (888) 631-1410 E : enquiries@rittmanmead.com


inquiries@rittmanmead.com W:
W:www.rittmanmead.com
www.rittmanmead.com

Big Data, Hadoop and Unstructured Data Sources

Big data is the hot topic in BI, DW and Analytics circles


The ability to harness vast datasets, at a highly-granular level, by harnessing massively-parallel computing
Crunching loosely-structured and modelled datasets using simple algorithms: Map (project) + Reduce (agg)
Largely based around open-source projects, non-relational technologies
Apache Hadoop
MapReduce
Hadoop Distributed File System
Apache Hive, Sqoop, HBase etc
Emerging commercial vendors
Cloudera
Hortonworks etc
Can be used standalone, or linked to an
enterprise DW/BI architecture

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Oracle s Strategy for Business Analytics

Connect to all of your data, from all your sources,


Subject it to the full range of possible inquiry
Package solutions for known problems and fixed sources, and
Deploy to PCs and mobile devices, on premise or in the cloud
Any Data,
Any Source

Full Range of
Analytics

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Integrated
Analytic Apps

On Premise,
On Cloud,
On Mobile

Connect to All of Your Data, From All of Your Sources

As well as traditional application and database files sources, unstructured source


and big data sources are within scope for business decision-making
Data of great volume, great velocity and great variety
Any Data,
Any Source

Your Data :
Decisions based on
your data

Big Data :
Decisions based on
all data relevant to you

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Transactions

Documents
& Social
Data

Machine-Generated
Data

Oracle s Big Data Products

Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing
Cloudera Distribution of Hadoop
Cloudera Manager
Open-source R
Oracle NoSQL Database Community Edition
Oracle Enterprise Linux + Oracle JVM
Oracle Big Data Connectors
Oracle Loader for Hadoop (Hadoop > Oracle RBDMS)
Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)
Oracle Data Integration Adapter for Hadoop
Oracle R Connector for Hadoop
Oracle NoSQL Database (column/key-store DB based on BerkeleyDB)

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

ODI as Part of Oracle s Big Data Strategy

ODI is the data integration tool for extracting data from Hadoop/MapReduce, and loading
into Oracle Big Data Appliance, Oracle Exadata and Oracle Exalytics
Oracle Application Adaptor for Hadoop provides required data adapters
Load data into Hadoop from local filesystem,
or HDFS (Hadoop clustered FS)
Read data from Hadoop/MapReduce using
Apache Hive (JDBC) and HiveQL, load
into Oracle RDBMS using
Oracle Loader for Hadoop
Supported by Oracle s Engineered Systems
Exadata
Exalytics
Big Data Appliance (w/Cloudera Hadoop Distrib)

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

How ODI Accesses Hadoop and MapReduce

ODI accesses data in Hadoop clusters through Apache Hive


Metadata and query layer over MapReduce
Provides SQL-like language (HiveQL) and a
metadata store (data dictionary)
Provides a means to define tables , into which file
data is loaded, and then queried via MapReduce
Accessed via Hive JDBC driver
(separate Hadoop install required
on ODI server, for client libs)
Additional access through
Oracle Direct Connector for HDFS
and Oracle Loader for Hadoop

Hadoop Cluster
MapReduce

Hive Server
HiveQL

Oracle RDBMS
ODI 11g

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Direct-path loads using


Oracle Loader for Hadoop,
transformation logic in
MapReduce

Oracle Business Analytics and Big Data Sources

OBIEE 11g, and other Oracle Business Analytics tools, can also make use of big data sources
Oracle Exalytics, through in-memory aggregates and InfiniBand connection to Exadata, can analyze vast (structured)
datasets held in relational and OLAP databases
Endeca Information Discovery can analyze unstructured and semi-structured sources
InfiniBand connector to Big Data Applicance + Hadoop connector in OBIEE supports analysis via Map/Reduce
Oracle R distribution + Oracle Enterprise R supports SAS-style statistical analysis
of large data sets, as part of
Oracle Advanced Analytics Option
OBIEE can access Hadoop
datasource through another
Apache technology called Hive

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Opportunities for OBIEE and ODI with Big Data Sources and Tools

Load data from a Hadoop/HDFS/NoSQL environment into a structured DW for analysis


Provide OBIEE as an alternative to
Java coding or HiveQL for analysts
Leverage Hadoop & HDFS for
massively-parallel staging-layer
number crunching
Make use of low-cost, fault-tolerant
hardware for parts of your BI platform
Provide the reporting and analysis
for customers who have bought
Oracle Big Data Appliance

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

What is Hadoop?

Apache Hadoop is one of the most well-known Big Data technologies


Family of open-source products used to store, and analyze distributed datasets
Hadoop is the enabling framework, automatically parallelises and co-ordinates jobs
Moves the compute to the data
MapReduce is the programming framework
for filtering, sorting and aggregating data
Map : filter and interpret input data, create key/value pairs
Reduce : summarise and aggregate
MapReduce jobs can be written in any
language (Java etc), but it is complicated

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

What is HDFS?

The filesystem behind Hadoop, used to store data for Hadoop analysis
Unix-like, uses commands such as ls, mkdir, chown, chmod
Fault-tolerant, with rapid fault detection and recovery
High-throughput, with streaming data access and large block sizes
Designed for data-locality, placing data closed to where it is processed
Accessed from the command-line, via internet (hdfs://), GUI tools etc

[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -l

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Hive as the Hadoop Data Warehouse

MapReduce jobs are typically written in Java, but Hive can make this simpler
Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
Approach used by ODI and OBIEE
to gain access to Hadoop data
Allows Hadoop data to be accessed just like
any other data source (sort of...)

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Hive Data and Metadata

Hive uses a RBDMS metastore to hold


table and column definitions in schemas
Hive tables then map onto HDFS-stored files
Managed tables
External tables
Oracle-like query optimizer, compiler,
executor
JDBC and OBDC drivers,
plus CLI etc

Hive Driver
(Compile
Optimize, Execute)

Metastore

HDFS

HDFS or local files


loaded into Hive HDFS
area, using HiveQL
CREATE TABLE
command

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Managed Tables

External Tables

/user/hive/warehouse/

/user/oracle/
/user/movies/data/

HDFS files loaded into HDFS


using external process, then
mapped into Hive using
CREATE EXTERNAL TABLE
command

Transforming HiveQL Queries into MapReduce Jobs

HiveQL queries are automatically translated into Java MapReduce jobs


Selection and filtering part becomes Map tasks
Aggregation part becomes the Reduce tasks
SELECT a, sum(b)
FROM myTable
WHERE a<100

Map
Task

Map
Task

Map
Task

GROUP BY a
Reduce
Task

Reduce
Task

Result

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

An example Hive Query Session: Connect and Display Table List


[oracle@bigdatalite ~]$ hive
Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt
hive> show tables;
OK
dwh_customer
dwh_customer_tmp
i_dwh_customer
ratings
src_customer
src_sales_person
weblog
weblog_preprocessed
weblog_sessionized
Time taken: 2.925 seconds

Hive Server lists out all tables


that have been defined within the
Hive
environment

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

An example Hive Query Session: Display Table Row Count


hive> select count(*) from src_customer;!

Request count(*) from table

Total MapReduce jobs = 1


Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
Hive server generates
set hive.exec.reducers.bytes.per.reducer=
MapReduce job to map table
In order to limit the maximum number of reducers:
key/value pairs, and then reduce
set hive.exec.reducers.max=
the results to table count
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0003, Tracking URL =
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003
Kill Command = /usr/lib/hadoop-0.20/bin/
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003

2013-04-17 04:06:59,867 Stage-1 map


2013-04-17 04:07:03,926 Stage-1 map
2013-04-17 04:07:14,040 Stage-1 map
2013-04-17 04:07:15,049 Stage-1 map
Ended Job = job_201303171815_0003
OK
!
25
Time taken: 22.21 seconds!

=
=
=
=

0%, reduce =
100%, reduce
100%, reduce
100%, reduce

0%
= 0%
= 33%
= 100%

MapReduce job automatically run


by Hive Server

Results returned to user

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

OBIEE and ODI Access to Hive, Leveraging MapReduce with no Java Coding

Requests in HiveQL arrive via HiveODBC, HiveJDBC


or through the Hive command shell
JDBC and ODBC access requires Thift server
Provides RPC call interface over Hive for external procs
All queries then get parsed, optimized and compiled, then
sent to Hadoop NameNode and Job Tracker
Then Hadoop processes the query, generating MapReduce
jobs and distributing it to run in parallel across all data nodes
Hadoop access can still be performed procedurally if needed,
typically coded by hand in Java, or through Pig, etc
The equivalent of PL/SQL compared to SQL
But Hive works well with the OBIEE/ODI paradigm

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Complementary Technologies: HDFS, Cloudera Manager, Hue, Beeswax etc

You can download your own Hive binaries, libraries etc from Apache Hadoop website
Or use pre-built VMs and distributions from the likes of Cloudera
Cloudera CDH3/4 is used on Oracle Big Data Appliance
Open-source + proprietary tools (Cloudera Manager)
Other tools for managing Hive, HFDS etc including
Hue (HDFS file browser + management)
Beeswax (Hive administration + querying)
Other complementary/required Hadoop tools
Sqoop
HDFS
Thrift

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Demonstration
Simple Data Selection and Querying using Hive on Cloudera CDH3

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

ODI + Big Data Examples : Providing the Bridge Between Hadoop + OBIEE

OBIEE now has the ability to report

against Hadoop data, via Hive


Assumes that data is already loaded
into the Hive warehouse tables
ODI therefore can be used to load
the Hive tables, through either:
Loading Hive from files
Joining and loading from Hive-Hive
Loading and transforming via
shell scripts (python, perl etc)
ODI could also extract the Hive data
and load into Oracle, if more appropriate

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Configuring ODI 11.1.1.6+ for Hadoop Connectivity

Obtain an installation of Hadoop/Hive from somewhere (Cloudera CDH3/4 for example)


Copy the following files into a temp directory, archive and transfer to ODI environment
$HIVE_HOME/lib/*.jar
$HADOOP_HOME/hadoop-*-core*.jar,
$HADOOP_HOME/Hadoop-*-tools*.jar

for example...
/usr/lib/hive/lib/*.jar
/usr/lib/hadoop-0.20/hadoop-*-core*.jar,
/usr/lib/hadoop-0.20/Hadoop-*-tools*.jar

Copy JAR files into userlib directory and (standalone) agent lib directory
c:\Users\Administrator\AppData\Roaming\odi\oracledi\userlib

Restart ODI Studio


T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Registering HDFS and Hive Sources and Targets in the ODI Topology

For Hive sources and targets, use Hive technology

JDBC Driver : Apache Hive JDBC Driver


JDBC URL : jdbc:hive://[server_name]:10000/default
(Flexfield Name) Hive Metastore URIs : thrift://[server_name]:10000

For HFDS sources, use File technology


JDBC URL :
hdfs://[server_name]:port
Special HDFS trick to use File tech
(no specific HDFS technology)

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Reverse Engineering Hive, HDFS and Local File Datastores + Models

Hive tables reverse-engineer just like regular tables


Define model in Designer navigator, uses Hive RKM to retrieve table metadata
Information on Hive-specific metadata stored in flexfields
Hive Buckets
Hive Partition Column
Hive Cluster Column
Hive Sort Column

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Demonstration
ODI 11.1.1.6 Configured for Hadoop Access, with Hive/HFDS source and targets registered

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

ODI Application Adapter for Hadoop KMs

Application Adapter (pay-extra option) for Hadoop connectivity


Works for both Windows and Linux installs of ODI Studio
Need to source HiveJDBC drivers and JARs from separate Hadoop install
Provides six new knowledge modules
IKM File to Hive (Load Data)
IKM Hive Control Append
IKM Hive Transform
IKM File-Hive to Oracle (OLH)
CKM Hive
RKM Hive

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Oracle Loader for Hadoop

Oracle technology for accessing Hadoop data, and loading it into an Oracle database
Pushes data transformation, heavy lifting to the Hadoop cluster, using MapReduce
Direct-path loads into Oracle Database, partitioned and non-partitioned
Online and offline loads
Key technology for fast load of
Hadoop results into Oracle DB

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

IKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS

Uses the Hive Load Data command to load


from local or HDFS files
Calls Hadoop FS commands for simple
copy/move into/around HDFS
Commands generated by ODI through
IKM File to Hive (Load Data)

hive> load data inpath '/user/oracle/movielens_src/u.data'


> overwrite into table movie_ratings;

Loading data to table default.movie_ratings


Deleted hdfs://localhost.localdomain/user/hive/warehouse/
movie_ratings

OK
Time taken: 0.341 seconds!

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

IKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS

IKM File to Hive (Load Data) generates the


required HiveQL commands using a script template
Executed over HiveJDBC interface
Success/Failure/Warning returned to ODI

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Load Data and Hadoop SerDe (Serializer-Deserializer) Transformations

Hadoop SerDe transformations can be


accessed, for example to transform weblogs
Hadoop interface that contains:
Deserializer - converts incoming data
into Java objects for Hive manipulation
Serializer - takes Hive Java objects &
converts to output for HDFS
Library of SerDe transformations readily
available for use with Hive
Use the OVERRIDE_ROW_FORMAT
option in IKM to override regular column
mappings in Mapping tab

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

IKM Hive Control Append: Loading, Joining & Filtering Between Hive Tables

Hive source and target, transformations according to HiveQL


functionality (aggregations, functions etc)
Ability to join data sources
Other data sources can be used,
but will involve staging tables and
additional KMs (as per any multi-source join)

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

IKM Hive Transform: Use Custom Shell Scripts to Integrate into Hive Table

Gives developer the ability

to transform data
programmatically using
Python, Perl etc scripts
Options to map output
of script to columns in
Hive table
Useful for more
programmatic and complex
data transformations

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

IKM File-Hive to Oracle: Extract from Hive into Oracle Tables

Uses Oracle Loaded for Hadoop (OLH) to process

any filtering, aggregation, transformation in Hadoop,


using MapReduce
OLH part of Oracle Big Data Connectors (additional cost)
High-performance loader into Oracle DB
Optional sort by primary key, pre-partioning of data
Can utilise the two OLH loading modes:
JDBC or OCI direct load into Oracle
Unload to files, Oracle DP into Oracle DB

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Demonstration
Data Integration Tasks using ODIAAH Hadoop KMs

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

NoSQL Data Sources and Targets with ODI 11g

No specific technology or driver for NoSQL databases, but can use Hive external tables
Requires a specific Hive Storage Handler for key/value store sources

Hive feature for accessing data from other DB systems, for example MongoDB, Cassandra
For example, https://github.com/vilcek/HiveKVStorageHandler
Additionally needs Hive collect_set aggregation method to aggregate results
Has to be defined in Languages panel in Topology

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Pig, Sqoop and other Hadoop Technologies, and Hive

Future versions of ODI might use other Hadoop technologies


Apache Sqoop for bulk transfer between Hadoop and RBDMSs
Other technologies are not such an obvious fit
Apache Pig - the equivalent of PL/SQL for Hive s SQL
Commercial vendors may produce better versions of Hive, MapReduce etc

Cloudera Impala - more real-time version of Hive


MapR - solves many current issues with MapReduce, 100% Hadoop API compatibility
Watch this space...!

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

BI Forum 2013 Master Class


Introduction to Oracle Data
Integrator (ODI)
Mark Rittman, Technical Director, Rittman Mead

T : +1 (888) 631-1410 E : enquiries@rittmanmead.com


inquiries@rittmanmead.com W:
W:www.rittmanmead.com
www.rittmanmead.com

Vous aimerez peut-être aussi