Odi Mclass 4 Big Data

BI Forum 2013 Master Class
Introduction to Oracle Data

Integrator (ODI)
Mark Rittman, Technical Director, Rittman Mead
T : +1 (888) 631-1410 E : enquiries@rittmanmead.com

inquiries@rittmanmead.com W:
W:www.rittmanmead.com
www.rittmanmead.com
Big Data, Hadoop and Unstructured Data Sources
Big data is the hot topic in BI, DW and Analytics circles

The ability to harness vast datasets, at a highly-granular level, by harnessing massively-parallel computing
Crunching loosely-structured and modelled datasets using simple algorithms: Map (project) + Reduce (agg)
Largely based around open-source projects, non-relational technologies
Apache Hadoop
MapReduce
Hadoop Distributed File System
Apache Hive, Sqoop, HBase etc
Emerging commercial vendors
Cloudera
Hortonworks etc
Can be used standalone, or linked to an
enterprise DW/BI architecture
T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com
Oracle s Strategy for Business Analytics
Connect to all of your data, from all your sources,

Subject it to the full range of possible inquiry
Package solutions for known problems and fixed sources, and
Deploy to PCs and mobile devices, on premise or in the cloud
Any Data,
Any Source
Full Range of
Analytics
Integrated
Analytic Apps
On Premise,
On Cloud,
On Mobile
Connect to All of Your Data, From All of Your Sources
As well as traditional application and database files sources, unstructured source

and big data sources are within scope for business decision-making
Data of great volume, great velocity and great variety
Any Data,
Any Source
Your Data :
Decisions based on
your data

Big Data :
Decisions based on
all data relevant to you

Transactions

Documents
& Social
Data

Machine-Generated
Data

Oracle s Big Data Products
Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing
Cloudera Distribution of Hadoop
Cloudera Manager
Open-source R
Oracle NoSQL Database Community Edition
Oracle Enterprise Linux + Oracle JVM
Oracle Big Data Connectors
Oracle Loader for Hadoop (Hadoop > Oracle RBDMS)
Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)
Oracle Data Integration Adapter for Hadoop
Oracle R Connector for Hadoop
Oracle NoSQL Database (column/key-store DB based on BerkeleyDB)
T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com
ODI as Part of Oracle s Big Data Strategy
ODI is the data integration tool for extracting data from Hadoop/MapReduce, and loading
into Oracle Big Data Appliance, Oracle Exadata and Oracle Exalytics
Oracle Application Adaptor for Hadoop provides required data adapters
Load data into Hadoop from local filesystem,
or HDFS (Hadoop clustered FS)
Read data from Hadoop/MapReduce using
Apache Hive (JDBC) and HiveQL, load
into Oracle RDBMS using
Oracle Loader for Hadoop
Supported by Oracle s Engineered Systems
Exadata
Exalytics
Big Data Appliance (w/Cloudera Hadoop Distrib)
How ODI Accesses Hadoop and MapReduce
ODI accesses data in Hadoop clusters through Apache Hive

Metadata and query layer over MapReduce
Provides SQL-like language (HiveQL) and a
metadata store (data dictionary)
Provides a means to define tables , into which file
data is loaded, and then queried via MapReduce
Accessed via Hive JDBC driver
(separate Hadoop install required
on ODI server, for client libs)
Additional access through
Oracle Direct Connector for HDFS
and Oracle Loader for Hadoop
Hadoop Cluster
MapReduce
Hive Server
HiveQL
Oracle RDBMS
ODI 11g
Direct-path loads using

Oracle Loader for Hadoop,
transformation logic in
MapReduce
Oracle Business Analytics and Big Data Sources
OBIEE 11g, and other Oracle Business Analytics tools, can also make use of big data sources
Oracle Exalytics, through in-memory aggregates and InfiniBand connection to Exadata, can analyze vast (structured)
datasets held in relational and OLAP databases
Endeca Information Discovery can analyze unstructured and semi-structured sources
InfiniBand connector to Big Data Applicance + Hadoop connector in OBIEE supports analysis via Map/Reduce
Oracle R distribution + Oracle Enterprise R supports SAS-style statistical analysis
of large data sets, as part of
Oracle Advanced Analytics Option
OBIEE can access Hadoop
datasource through another
Apache technology called Hive
Opportunities for OBIEE and ODI with Big Data Sources and Tools
Load data from a Hadoop/HDFS/NoSQL environment into a structured DW for analysis

Provide OBIEE as an alternative to
Java coding or HiveQL for analysts
Leverage Hadoop & HDFS for
massively-parallel staging-layer
number crunching
Make use of low-cost, fault-tolerant
hardware for parts of your BI platform
Provide the reporting and analysis
for customers who have bought
Oracle Big Data Appliance
What is Hadoop?
Apache Hadoop is one of the most well-known Big Data technologies

Family of open-source products used to store, and analyze distributed datasets
Hadoop is the enabling framework, automatically parallelises and co-ordinates jobs
Moves the compute to the data
MapReduce is the programming framework
for filtering, sorting and aggregating data
Map : filter and interpret input data, create key/value pairs
Reduce : summarise and aggregate
MapReduce jobs can be written in any
language (Java etc), but it is complicated
What is HDFS?
The filesystem behind Hadoop, used to store data for Hadoop analysis
Unix-like, uses commands such as ls, mkdir, chown, chmod
Fault-tolerant, with rapid fault detection and recovery
High-throughput, with streaming data access and large block sizes
Designed for data-locality, placing data closed to where it is processed
Accessed from the command-line, via internet (hdfs://), GUI tools etc
[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -l
Hive as the Hadoop Data Warehouse
MapReduce jobs are typically written in Java, but Hive can make this simpler
Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
Approach used by ODI and OBIEE
to gain access to Hadoop data
Allows Hadoop data to be accessed just like
any other data source (sort of...)
Hive Data and Metadata
Hive uses a RBDMS metastore to hold

table and column definitions in schemas
Hive tables then map onto HDFS-stored files
Managed tables
External tables
Oracle-like query optimizer, compiler,
executor
JDBC and OBDC drivers,
plus CLI etc
Hive Driver
(Compile
Optimize, Execute)
Metastore
HDFS
HDFS or local files

loaded into Hive HDFS
area, using HiveQL
CREATE TABLE
command
Managed Tables
External Tables
/user/hive/warehouse/
/user/oracle/
/user/movies/data/
HDFS files loaded into HDFS

using external process, then
mapped into Hive using
CREATE EXTERNAL TABLE
command
Transforming HiveQL Queries into MapReduce Jobs
HiveQL queries are automatically translated into Java MapReduce jobs

Selection and filtering part becomes Map tasks
Aggregation part becomes the Reduce tasks
SELECT a, sum(b)
FROM myTable
WHERE a<100
Map
Task
Map
Task
Map
Task
GROUP BY a
Reduce
Task
Reduce
Task
Result
An example Hive Query Session: Connect and Display Table List

[oracle@bigdatalite ~]$ hive
Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt
hive> show tables;
OK
dwh_customer
dwh_customer_tmp
i_dwh_customer
ratings
src_customer
src_sales_person
weblog
weblog_preprocessed
weblog_sessionized
Time taken: 2.925 seconds
Hive Server lists out all tables

that have been defined within the
Hive
environment
An example Hive Query Session: Display Table Row Count

hive> select count(*) from src_customer;!
Request count(*) from table
Total MapReduce jobs = 1

Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
Hive server generates
set hive.exec.reducers.bytes.per.reducer=
MapReduce job to map table
In order to limit the maximum number of reducers:
key/value pairs, and then reduce
set hive.exec.reducers.max=
the results to table count
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0003, Tracking URL =
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003
Kill Command = /usr/lib/hadoop-0.20/bin/
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003
2013-04-17 04:06:59,867 Stage-1 map

2013-04-17 04:07:03,926 Stage-1 map
2013-04-17 04:07:14,040 Stage-1 map
2013-04-17 04:07:15,049 Stage-1 map
Ended Job = job_201303171815_0003
OK
!
25
Time taken: 22.21 seconds!
=
=
=
=
0%, reduce =
100%, reduce
100%, reduce
100%, reduce
0%
= 0%
= 33%
= 100%
MapReduce job automatically run

by Hive Server
Results returned to user
OBIEE and ODI Access to Hive, Leveraging MapReduce with no Java Coding
Requests in HiveQL arrive via HiveODBC, HiveJDBC

or through the Hive command shell
JDBC and ODBC access requires Thift server
Provides RPC call interface over Hive for external procs
All queries then get parsed, optimized and compiled, then
sent to Hadoop NameNode and Job Tracker
Then Hadoop processes the query, generating MapReduce
jobs and distributing it to run in parallel across all data nodes
Hadoop access can still be performed procedurally if needed,
typically coded by hand in Java, or through Pig, etc
The equivalent of PL/SQL compared to SQL
But Hive works well with the OBIEE/ODI paradigm
Complementary Technologies: HDFS, Cloudera Manager, Hue, Beeswax etc
You can download your own Hive binaries, libraries etc from Apache Hadoop website
Or use pre-built VMs and distributions from the likes of Cloudera
Cloudera CDH3/4 is used on Oracle Big Data Appliance
Open-source + proprietary tools (Cloudera Manager)
Other tools for managing Hive, HFDS etc including
Hue (HDFS file browser + management)
Beeswax (Hive administration + querying)
Other complementary/required Hadoop tools
Sqoop
HDFS
Thrift
Demonstration
Simple Data Selection and Querying using Hive on Cloudera CDH3
ODI + Big Data Examples : Providing the Bridge Between Hadoop + OBIEE
OBIEE now has the ability to report
against Hadoop data, via Hive

Assumes that data is already loaded
into the Hive warehouse tables
ODI therefore can be used to load
the Hive tables, through either:
Loading Hive from files
Joining and loading from Hive-Hive
Loading and transforming via
shell scripts (python, perl etc)
ODI could also extract the Hive data
and load into Oracle, if more appropriate
Configuring ODI 11.1.1.6+ for Hadoop Connectivity
Obtain an installation of Hadoop/Hive from somewhere (Cloudera CDH3/4 for example)

Copy the following files into a temp directory, archive and transfer to ODI environment
$HIVE_HOME/lib/*.jar
$HADOOP_HOME/hadoop-*-core*.jar,
$HADOOP_HOME/Hadoop-*-tools*.jar
for example...
/usr/lib/hive/lib/*.jar
/usr/lib/hadoop-0.20/hadoop-*-core*.jar,
/usr/lib/hadoop-0.20/Hadoop-*-tools*.jar
Copy JAR files into userlib directory and (standalone) agent lib directory
c:\Users\Administrator\AppData\Roaming\odi\oracledi\userlib
Restart ODI Studio

Registering HDFS and Hive Sources and Targets in the ODI Topology
For Hive sources and targets, use Hive technology
JDBC Driver : Apache Hive JDBC Driver

JDBC URL : jdbc:hive://[server_name]:10000/default
(Flexfield Name) Hive Metastore URIs : thrift://[server_name]:10000
For HFDS sources, use File technology

JDBC URL :
hdfs://[server_name]:port
Special HDFS trick to use File tech
(no specific HDFS technology)
Reverse Engineering Hive, HDFS and Local File Datastores + Models
Hive tables reverse-engineer just like regular tables

Define model in Designer navigator, uses Hive RKM to retrieve table metadata
Information on Hive-specific metadata stored in flexfields
Hive Buckets
Hive Partition Column
Hive Cluster Column
Hive Sort Column
Demonstration
ODI 11.1.1.6 Configured for Hadoop Access, with Hive/HFDS source and targets registered
ODI Application Adapter for Hadoop KMs
Application Adapter (pay-extra option) for Hadoop connectivity

Works for both Windows and Linux installs of ODI Studio
Need to source HiveJDBC drivers and JARs from separate Hadoop install
Provides six new knowledge modules
IKM File to Hive (Load Data)
IKM Hive Control Append
IKM Hive Transform
IKM File-Hive to Oracle (OLH)
CKM Hive
RKM Hive
Oracle Loader for Hadoop
Oracle technology for accessing Hadoop data, and loading it into an Oracle database
Pushes data transformation, heavy lifting to the Hadoop cluster, using MapReduce
Direct-path loads into Oracle Database, partitioned and non-partitioned
Online and offline loads
Key technology for fast load of
Hadoop results into Oracle DB
IKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS
Uses the Hive Load Data command to load

from local or HDFS files
Calls Hadoop FS commands for simple
copy/move into/around HDFS
Commands generated by ODI through
IKM File to Hive (Load Data)
hive> load data inpath '/user/oracle/movielens_src/u.data'

> overwrite into table movie_ratings;
Loading data to table default.movie_ratings

Deleted hdfs://localhost.localdomain/user/hive/warehouse/
movie_ratings
OK
Time taken: 0.341 seconds!
IKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS
IKM File to Hive (Load Data) generates the

required HiveQL commands using a script template
Executed over HiveJDBC interface
Success/Failure/Warning returned to ODI
Load Data and Hadoop SerDe (Serializer-Deserializer) Transformations
Hadoop SerDe transformations can be

accessed, for example to transform weblogs
Hadoop interface that contains:
Deserializer - converts incoming data
into Java objects for Hive manipulation
Serializer - takes Hive Java objects &
converts to output for HDFS
Library of SerDe transformations readily
available for use with Hive
Use the OVERRIDE_ROW_FORMAT
option in IKM to override regular column
mappings in Mapping tab
IKM Hive Control Append: Loading, Joining & Filtering Between Hive Tables
Hive source and target, transformations according to HiveQL

functionality (aggregations, functions etc)
Ability to join data sources
Other data sources can be used,
but will involve staging tables and
additional KMs (as per any multi-source join)
IKM Hive Transform: Use Custom Shell Scripts to Integrate into Hive Table
Gives developer the ability
to transform data
programmatically using
Python, Perl etc scripts
Options to map output
of script to columns in
Hive table
Useful for more
programmatic and complex
data transformations
IKM File-Hive to Oracle: Extract from Hive into Oracle Tables
Uses Oracle Loaded for Hadoop (OLH) to process
any filtering, aggregation, transformation in Hadoop,

using MapReduce
OLH part of Oracle Big Data Connectors (additional cost)
High-performance loader into Oracle DB
Optional sort by primary key, pre-partioning of data
Can utilise the two OLH loading modes:
JDBC or OCI direct load into Oracle
Unload to files, Oracle DP into Oracle DB
Demonstration
Data Integration Tasks using ODIAAH Hadoop KMs
NoSQL Data Sources and Targets with ODI 11g
No specific technology or driver for NoSQL databases, but can use Hive external tables
Requires a specific Hive Storage Handler for key/value store sources
Hive feature for accessing data from other DB systems, for example MongoDB, Cassandra
For example, https://github.com/vilcek/HiveKVStorageHandler
Additionally needs Hive collect_set aggregation method to aggregate results
Has to be defined in Languages panel in Topology
Pig, Sqoop and other Hadoop Technologies, and Hive
Future versions of ODI might use other Hadoop technologies

Apache Sqoop for bulk transfer between Hadoop and RBDMSs
Other technologies are not such an obvious fit
Apache Pig - the equivalent of PL/SQL for Hive s SQL
Commercial vendors may produce better versions of Hive, MapReduce etc
Cloudera Impala - more real-time version of Hive

MapR - solves many current issues with MapReduce, 100% Hadoop API compatibility
Watch this space...!
BI Forum 2013 Master Class

Introduction to Oracle Data
Integrator (ODI)
Mark Rittman, Technical Director, Rittman Mead
T : +1 (888) 631-1410 E : enquiries@rittmanmead.com

inquiries@rittmanmead.com W:
W:www.rittmanmead.com
www.rittmanmead.com

Odi Mclass 4 Big Data

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Odi Mclass 4 Big Data

Transféré par

Droits d'auteur :

Formats disponibles

BI Forum 2013 Master Class

Introduction to Oracle Data

T : +1 (888) 631-1410 E : enquiries@rittmanmead.com

Big Data, Hadoop and Unstructured Data Sources

Big data is the hot topic in BI, DW and Analytics circles

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Oracle s Strategy for Business Analytics

Connect to all of your data, from all your sources,

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Connect to All of Your Data, From All of Your Sources

As well as traditional application and database files sources, unstructured source

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Oracle s Big Data Products

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

ODI as Part of Oracle s Big Data Strategy

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

How ODI Accesses Hadoop and MapReduce

ODI accesses data in Hadoop clusters through Apache Hive

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Direct-path loads using

Oracle Business Analytics and Big Data Sources

T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.com

Load data from a Hadoop/HDFS/NoSQL environment into a structured DW for analysis

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Apache Hadoop is one of the most well-known Big Data technologies

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -l

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Hive as the Hadoop Data Warehouse

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Hive Data and Metadata

Hive uses a RBDMS metastore to hold

HDFS or local files

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

HDFS files loaded into HDFS

Transforming HiveQL Queries into MapReduce Jobs

HiveQL queries are automatically translated into Java MapReduce jobs

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

An example Hive Query Session: Connect and Display Table List

Hive Server lists out all tables

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

An example Hive Query Session: Display Table Row Count

Request count(*) from table

Total MapReduce jobs = 1

2013-04-17 04:06:59,867 Stage-1 map

MapReduce job automatically run

Results returned to user

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Requests in HiveQL arrive via HiveODBC, HiveJDBC

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Complementary Technologies: HDFS, Cloudera Manager, Hue, Beeswax etc

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

OBIEE now has the ability to report

against Hadoop data, via Hive

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Configuring ODI 11.1.1.6+ for Hadoop Connectivity

Obtain an installation of Hadoop/Hive from somewhere (Cloudera CDH3/4 for example)

Restart ODI Studio

For Hive sources and targets, use Hive technology

JDBC Driver : Apache Hive JDBC Driver

For HFDS sources, use File technology

T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.com

Reverse Engineering Hive, HDFS and Local File Datastores + Models