Vous êtes sur la page 1sur 2

Journey of Schema into Big Data World

Schemas have a long history in the software world. Most of their popularity should be
attributed to relational databases like oracle.
In RDBMS world schemas are your best friend. You create schema, do operations using
schema and hardly care about underlying storage (unless you are a DBA). Schema is fact
need to be created before even a single byte of data is written, therefore the schema type in
relational world is called schema-on-write.
In Big Data world specifically Hadoop, this changed. Hadoop is primarily a storage platform
and with diminishing popularity of MapReduce, it has become even more so. Hadoop has
schema type called schema-on-read in which, you do not even bother about schema until you
are ready to read the data.When it comes to reading the data, you need to figure out a schema.
The reason being that you have to give some structure to even most unstructured data before
you can make any use of it.
In Hadoop data is stored in the form of blocks on slave machines. Now this block which is
my default 64 MB and can be as big as 1 GB in a lot of deployments, needs to be sliced into
digestible pieces. This digestible piece is called a record and the format using which you slice
data is called InputFormat. The image below gives a good idea of it using pizza eating as
analogy.

This InputFormat which by default is TextInputFormat, you can think of as earliest form of
schema in Hadoop MapReduce. Users were not much interested in writing MapReduce code
directly and that lead to creation of Hive. Hive had its own schema on read but that was real
relational schema aka metadata.Hive stores metadta, in a separate relational storage called
metastore.
Hive definitely made work easier but still managing schema separately was not big data
developers favorite.
Slowly came formats like avro and parquet which embed schema in the same files which has
data.These formats bring their own benefits which are plenty but having schema embedded in
them is something which separated them with earlier formats.JSON which became
immensely popular on its own (mostly as replacement of XML) brings same benefits of
schema combined with data and is also extensively used in big data systems.
This trend also is reflecting in evolution of Apache Spark, the big data compute framework
InfoObjects has specific focus on. Spark started with Resilient Distributed Dataset i.e. RDDs
as unit of compute as well as unit of in-memory storage. Spark had to address need to work
with relational queries early-on. Initially it came up with Shark which was Hive on Spark.
Shark had challenges and that led to creation of Spark SQL a year back.
Spark SQL initially had a unit of compute called SchemaRDD which was essentially RDD +
Schema put on top of it. Spark 1.3 onwards SchemaRDDs evolved into DataFrames. Now
there is a lot of work being done in Spark to treat DataFrames as first-class objects, latest
being Spark ML module.
In summary, this evolution of storage into Parquet type formats and evolution of compute
into DataFrames have made life of a big data developer very easy. Now with one command a
JSON file can be loaded into DataFrame and with another command saves into parquet
format.

scala> val df =
sqlContext.load("hdfs://localhost:9000/user/hduser/person", "json")
scala> df.select("first_name", "last_name").save("person.parquet",
"parquet")

Vous aimerez peut-être aussi