Vous êtes sur la page 1sur 7

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

Mostly compatible with Hadoop/HDFS

Apache Drill - provides low latency ad-hoc queries to many different data sources,
including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000
servers and query petabytes of data in seconds.
Apache Hama - is a pure BSP (Bulk Synchronous Parallel) computing framework on top
of HDFS for massive scientific computations such as matrix, graph and network algorithms.
Akka - a toolkit and runtime for building highly concurrent, distributed, and fault
tolerant event-driven applications on the JVM.
ML-Hadoop - Hadoop implementation of Machine learning algorithms
Shark - is a large-scale data warehouse system for Spark designed to be compatible with
Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any
modification to the existing data or queries. Shark supports Hive's query language,
metastore, serialization formats, and user-defined functions, providing seamless integration
with existing Hive deployments and a familiar, more powerful option for new ones.
Apache Crunch - Java library provides a framework for writing, testing, and running
MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined
functions simple to write, easy to test, and efficient to run
Azkaban - batch workflow job scheduler created at LinkedIn to run their Hadoop Jobs
Apache Mesos - is a cluster manager that provides efficient resource isolation and sharing
across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark,
and other applications on a dynamically shared pool of nodes.
Druid - is open source infrastructure for Realtime Exploratory Analytics on Large
Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed
for real-time querying and data ingestion. It leverages column-orientation and advanced
indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row
tables with sub-second latencies.
Apache MRUnit - a Java library that helps developers unit test Apache Hadoop map
reduce jobs.
hiho - Hadoop Data Integration with various databases, ftp servers, salesforce.
Incremental update, dedup, append, merge your data on Hadoop
white-elephant - a Hadoop log aggregator and dashboard which enables visualization of
Hadoop cluster utilization across users.
Tachyon - a fault tolerant distributed file system enabling reliable file sharing at memoryspeed across cluster frameworks, such as Spark and MapReduce
HIPI - is a library for Hadoop's MapReduce framework that provides an API for
performing image processing tasks in a distributed computing environment
Cassovary -a simple big graph processing library for the JVM
Apache Helix - is a generic cluster management framework used for the automatic
management of partitioned, replicated and distributed resources hosted on a cluster of nodes
Summingbird -Streaming MapReduce with Scalding and Storm
Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

MongoDB - an open-source document database, and the leading NoSQL database.


Written in C++
Katta - is a scalable, failure tolerant, distributed, data storage for real time access.
Kiji for building Real-time Big Data Applications on Apache HBase
MLBase - a platform addressing the issues of ML Developers & end users, which
consists of three components --MLlib, MLI, ML Optimizer
cloud9 - is a collection of Hadoop tools that tries to make working with big data a bit
easier.
elasticsearch - flexible and powerful open source, distributed real-time
search and analytics engine for the cloud
Apache Curator- is a set of Java libraries that make using Apache ZooKeeper much
easier.
Parquet is a columnar storage format for Hadoop.
OpenTSDB - is a distributed, scalable Time Series Database (TSDB) written on top
of HBase. OpenTSDB was written to address a common need: store, index and serve metrics
collected from computer systems (network gear, operating systems, applications) at a large
scale, and make this data easily accessible and graphable.
Giraph - is an iterative graph processing system built for high scalability. For example, it
is currently used at Facebook to analyze the social graph formed by users and their
connections. Giraph originated as the open-source counterpart to Pregel, the graph
processing architecture developed at Google and described in a 2010 paper
CouchDB - is a database that uses JSON for
documents,JavaScript for MapReduce queries, and regular HTTP for an API
Datafu- is a collection of user-defined functions for working with large-scale data in
Hadoop and Pig. This library was born out of the need for a stable, well-tested library of
UDFs for data mining and statistics.
Norbert - is a cluster manager and networking layer built on top of Zookeeper
Apache Samza - is a distributed stream processing framework. It uses Apache Kafka for
messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation,
security, and resource management.
Apache Kafka - is publish-subscribe messaging rethought as a distributed commit log.
Apache Whirr - is a set of libraries for running cloud services.
HUE - a File Browser for HDFS, a Job Browser for MapReduce/YARN, an HBase
Browser, query editors for Hive, Pig, Cloudera Impala and Sqoop2.
Nagios offers complete monitoring and alerting for servers, switches, applications, and
services.
Ganglia - is a scalable distributed monitoring system for high-performance computing
systems such as clusters and Grids
Apache Thrift is software framework, for scalable cross-language services
development, combines a software stack with a code generation engine to build services that
work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl,
Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.
Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

Prediction.io - is an open source machine learning server for software developers to


create predictive features, such as personalization, recommendation and content discovery
CloudMapReduce - A MapReduce implementation on Amazon Cloud OS
Titan - is a scalable graph database optimized for storing and querying graphs containing
hundreds of billions of vertices and edges distributed across a multi-machine cluster.

Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

Hadoop Alternatives

Apache Spark- open source cluster computing system that aims to make data
analytics fast both fast to run and fast to write.
GraphLab - a redesigned fully distributed API, HDFS integration and a wide range of
new machine learning toolkits.
HPCC Systems- (High Performance Computing Cluster) is a massive parallel-processing
computing platform that solves Big Data problems.
Dryad- is investigating programming models for writing parallel and distributed
programs to scale from a small cluster to a large data-center.
Stratosphere above the cloud.
Storm - is a free and open source distributed realtime computation system. Storm makes
it easy to reliably process unbounded streams of data, doing for realtime processing what
Hadoop did for batch processing. Storm is simple, can be used with any programming
language, and is a lot of fun to use!
R3 - is a map reduce engine written in python using a redis backend.
Disco - is a lightweight, open-source framework for distributed computing based on
the MapReduce paradigm.
Phoenix - is a shared-memory implementation of Google's MapReduce model for dataintensive processing tasks.
Plasma - PlasmaFS is a distributed filesystem for large files, implemented in user space.
Plasma Map/Reduce runs the famous algorithm scheme for mapping and rearranging
large files. Plasma KV is a key/value database on top of PlasmaFS
Peregrine - is a map reduce framework designed for running iterative jobs across
partitions of data.
httpmr - A scalable data processing framework for people with web clusters.
sector/sphere - sector is a high performance, scalable, and secure distributed file system.
Sphere is a high performance parallel data processing engine that can process Sector data
files on the storage nodes with very simple programming interfaces.
Filemap - is a lightweight system for applying Unix-style file processing tools to large
amounts of data stored in files.
misco - is a distributed computing framework designed for mobile devices
MR-MPI is a library, which is an open-source implementation of MapReduce written
for distributed-memory parallel machines on top of standard MPI message passing
GridGain in-memory computing

Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

MapReduce Alternatives

Octopy - is a fast-n-easy MapReduce implementation for Python.


Cassalog - is a fully-featured data processing and querying library for Clojure or Java.
The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing
analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and
Cascading and operates at a significantly higher level of abstraction than those tools.
Cascading is an application framework for Java developers to simply develop robust
Data Analytics and Data Management applications on Apache Hadoop
MySpace Qizmt is a mapreduce framework for executing and developing distributed
computation applications on large clusters of Windows servers
bashreduce - mapreduce in bash
Meguro - a simple Javascript Map/Reduce framework
mincemeatpy - Lightweight MapReduce in python
skynet - A Ruby MapReduce Framework
mapredus - simple mapreduce framework using redis and resque
starfish - is a utility to make distributed programming ridiculously easy.
GPMR is a MapReduce library that leverages the power of GPU clusters for large-scale
computing.
Elastic Phoenix - An elastic MapReduce framework based on Phoenix

Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

Hadoop Eco-System

Zookeeper a centralized service for maintaining configuration information, naming,


providing distributed synchronization, and providing group services.
Avro - a data serialization system
HBase - is the Hadoop database, a distributed, scalable, big data store.
Sqoop - is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured datastores such as relational databases.
Hive - a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL
Pig - is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these
programs
Chukwa - is an open source data collection system for monitoring large distributed
systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and
Map/Reduce framework and inherits Hadoops scalability and robustness. Chukwa also
includes a exible and powerful toolkit for displaying, monitoring and analyzing results
to make the best use of the collected data
Flume - is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. It uses a simple
extensible data model that allows for online analytic application.
Oozie - is a workflow scheduler system to manage Apache Hadoop jobs.
Mahout scalable machine learning libraries
Ambari is aimed at making Hadoop management simpler by developing software for
provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an
intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
HCatalog - is a set of interfaces that open up access to Hive's metastore for tools inside
and outside of the Hadoop grid.
Cassandra - is the right database of choice when you need scalability and high availability
without compromising performance. Linear scalability and proven fault-tolerance on
commodity hardware or cloud infrastructure make it the perfect platform for missioncritical data.
Hadoop - is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is
Created By:-Samarjit Mahapatra

Projects other than Hadoop!

Created By:-Samarjit Mahapatra


Samarjit.mahapatra@accenture.com

designed to detect and handle failures at the application layer, so delivering a highlyavailable service on top of a cluster of computers, each of which may be prone to failures.
Bigtop - is a project for the development of packaging and tests of the Apache Hadoop
ecosystem.

Created By:-Samarjit Mahapatra