Vous êtes sur la page 1sur 4

Experience Report: Processing 6 Billion CDRs/day - from

Research to Production
Eric Bouillet1 Ravi Kothari3 Vibhore Kumar2 Laurent Mignet3 Senthil Nathan2
Anand Ranganathan2 Deepak S. Turaga2 Octavian Udrea2 Olivier Verscheure1
IBM Technology Campus, Damastown Industrial Estate, Mulhuddart, Dublin 15, Ireland1
Thomas J. Watson Research Center, IBM Research, 19 Skyline Drive, Hawthorne, NY 10532, USA2
IBM Research - India, 4 Block C, Institutional Area, Vasant Kunj, New Delhi - 110070, India3

{bouillet,verscheure}@ie.ibm.com
{vibhorek,sen,arangana,turaga,oudrea}@us.ibm.com {rkothari,lamignet}@in.ibm.com
ABSTRACT

H.4 [Information Systems Applications]: Miscellaneous

becoming evident in several other domains as well, lies in


the ability to derive timely actionable insights from massive
amounts of customer and operational data that is available
with the organization. For telecommunications companies
that predominantly oer wireless services, a large volume of
their data takes the form of call detail records (CDRs) and
an event processing system that is capable of ingesting and
analyzing these CDRs in real-time, as they are generated
by the network equipment, can provide valuable insights.
These insights can range from real-time billing to location
dependent marketing oers, to detecting, in real-time, the
issues being faced by the subscribers (e.g. dropped calls),
to expediting the detection and diagnosis of issues with the
network infrastructure.
This paper describes our experience with implementing
and deploying a novel CDR processing application using the
IBM InfoSphere Streams [2] middleware. The deployed realtime CDR processing application is a mediation and analytics solution capable of ingesting CDRs from a variety of
network elements, transforming and enriching such CDRs
in real-time, performing on-the-y aggregation and analytics on the CDR stream and nally loading the CDRs into a
warehouse for archival and for performing deeper analytics.
This paper attempts to capture the challenges and the specic solutions that were implemented for the real-time CDR
processing application.

Keywords

2.

Call Detail Records, Mediation, Real-Time Analytics, IBM


InfoSphere Streams

Call Detail Records are structured event records generated


within a wireless telecommunication network, by network
switches and elements, to summarize various aspects of individual connections for dierent types of services including
voice, Short Message Service, Multimedia Message Service
etc. Typical CDRs contain information about the call origin,
call destination, timestamp, call duration, sequence number
as well as additional information such as call status (busy,
drop, connected), fault conditions, number to be charged
etc. Capturing and processing all generated CDRs is central to support critical telecommunication service provider
applications including billing, revenue assurance (RA), and
fraud management (FM). Additionally, analysis of CDRs
can provide several insights into the state of the network, call
distribution, user behavior - all of which are necessary for
several business intelligence (BI) applications. These applications range from long term provisioning, load-balancing,
and system design all the way to several real-time services

A call detail record (CDR), is a data record produced by


a telephone exchange or other telecommunications equipment documenting the details of a phone call that passed
through the exchange or equipment. Telecommunications
companies (or telcos) use CDRs for purposes of billing, extracting business intelligence, fraud detection, etc. However,
they face a Big Data challenge many telcos get billions of
CDRs per day, and are unable to keep up with these data
rates. In this paper, we describe a stream processing solution for processing CDRs that allows scaling the processing
to handle 6 billion CDRs per day for a certain telco. We
describe the stream processing application (running on the
IBM InfoSphere Streams platform) that performs CDR mediation and analysis in real-time. We also describe various
business and operational constraints and the legacy software
ecosystem - seldom discussed in academic gatherings - that
make the problem more challenging than originally meets
the eye. The outcome of our work is a highly congurable
and scalable CDR processing stream with several functional
and performance capabilities that are a rst for the telecommunication industry.

Categories and Subject Descriptors

1.

INTRODUCTION

In todays competitive market, telecommunications companies are in a race to dierentiate themselves from the competition and are striving hard to maintain their prot margins. The key to staying ahead of the competition, as is

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
DEBS 12, July 1620, 2012, Berlin, Germany.
Copyright 2012 ACM 978-1-4503-1315-5 ...$10.00.

264

SOLUTION OVERVIEW

Real-Time Aggregates & Dashboards

Master Script

File parsing &


error handling

Rules: Lookups &


Transforms

Rules Compiler

Checkpoint
Controller

De-Duplication
Parallel or Serial Write

File Ingest &


Parallelization

Config Files

CDR
Repository

External Data
CDR Statistics

Figure 1: Solution Architecture


for fault recovery, customer experience management, content
and location driven advertising, e-commerce applications,
and several other novel applications (e.g. social networking,
real-time transportation services etc.).
Mediation is the rst step in processing these CDRs, and
involves capturing CDRs from upstream network systems
and making them ready for processing by downstream applications (RA, BI, FM, warehousing). This is a complex
task composed of several steps that include:

management including support for application development


and extension, monitoring and end to end provenance, and
nally for real-time result visualization and validation. We
emphasize the key technical challenges associated with each
of these tasks that made it a complex, multi-disciplinary,
multi-month eort. We provide details of these challenges,
our design decisions, and implementation and results in the
following sections.

Collection: capture and ingest CDRs from various source


sub-systems
Validation and Filtering: identify relevant CDRs and
discard invalid or corrupted CDRs
Collation: correlate and aggregate all CDRs corresponding to one call

3.

3.1

Format Conversion and Normalization: parse binary


and proprietary formats to extract elds of interest

REQUIREMENTS AND CHALLENGES

The key challenges in the implementation of the CDR mediation and analysis solution using IBM InfoSphere Streams
were around performance, scalability, latency, ease-of-use
and fault-tolerance. The following section briey describe
these challenges.

Performance

The key requirement was the ability to process 6 billion


CDRs per day, which translates to around 70,000 CDRs
per second. However, various operational constraints meant
that the desired throughput was about three times that
value (i.e. around 220,000 CDRs per second). This is because of frequent power and infrastructure outages, sporadic
human operator errors in conguring the system, delays in
getting source data from switches, etc. These issues often
result in backlogs of unprocessed CDRs that need to be processed asap. Hence, the system had to be able to support a
higher throughput so as to overcome these issues.

Enrichment and Transformation: Apply business rules


to enrich and transform CDRs
De-duplication: Filter duplicate CDRs that may have
been injected into the data-ow by source sub-systems
Analysis and Summarization: Compute aggregates and
summaries of CDR data, and visualize results
Distribution: Transmit CDRs for further downstream
processing

3.2

Additionally, given that this is a business-critical task, there


are several requirements on the performance and the faulttolerance of a mediation system.
In this paper we describe the mediation and analysis system we built and deployed (in production at a large telecommunication provider) using the InfoSphere Streams platform. Specically we provide an overview of the architecture and components developed for these dierent functions
(Figure 1, as well as the appropriate systems support for
the required performance, and guarantees on failure recovery. We also describe our tooling for user interaction and

Scalability

Since the subscriber base of telcos is growing at huge rates,


another key requirement was the ability to scale easily in
the future when the number of CDRs per day increases.
For example, telcos frequently see growth rates of 10-20%
every year. And they would like their applications to scale
up seamlessly when the rate increases, potentially by just
adding more hardware.

3.3

Latency

While most of the applications that are downstream to


a mediation system are not very sensitive to small delays

265

Intra-region parallelism
Region-wise parallel
processing of CDRs

Figure 2: Snapshot of deployed application showing parallel chains and intra-region parallelism
in arrival of data, most of them can benet from a mediation system that can bring down the latency from a few
hours/days to a few seconds/minutes. These include applications like fraud management, customer experience management and some early fault-detection applications. When
implementing a medition solution using InfoSphere Streams,
while the letency introduced by ingest and processing of
CDRs was virtually eliminated, the real-time detection of
duplicate CDRs over large windows of time (e.g. 15 days =
90 Billion CDRs) and maintaining high-througput and low
latency for inserts into a warehouse was a challenge that we
had to address.

provide transactional guarantees for processing CDRs (i.e.


a CDR must be processed completely or not at all). This
means that the system should keep track of which CDRs are
at dierent stages of processing. If the system fails, it should
come back up and reprocess partially processed CDRs.

3.4

In order to achieve the requirement of high-performance


and scalability, one of the key features of the system is parallelizing the CDR processing. A natural way of parallelizing
the processing is by region, which represents a locationbased zone of operations for a telco. All the processing for
each region (CDR transformation, enrichment, deduplication, aggregates, etc) can happen in parallel. One of the
challenges, though, is that dierent regions have dierent
loads at dierent times. Hence, our application has various
strategies to balance the processing across dierent parallel
chains based on current loads. Figure 2 shows a snapshot
of one version of the application where the incoming CDRs
are split across 15 parallel chains of processing (one corresponding to each region). The degree of parallelism, based
on the number of regions, can be easily modied by means
of a conguration parameter. To address issues that may
arise due to load disparity between the various regions, the
application exposes a conguration parameter that allows
one to select the number of parallel sub-chains for each region. While such parallelization strategy seems obvious, the
fault-tolerance issues and the need to maintain the throughput lead to several interesting challenges.

4.

4.1

Ease of Use

Telcos in many countries face the problem of not necessarily having highly skilled programmers or researchers to develop and manage highly performant applications for them.
They also often face the problem of employee churn. Hence,
a key requirement is that the application must be very easy
to deploy and manage by human operators, who may not be
familiar with high-performance computing or stream processing. Also, it should be easy for them to change certain
portions of the application as required, particularly the enrichment, transformation and lookup rules.

3.5

KEY FEATURES OF THE SYSTEM

In this section, we describe some of the key features of the


CDR processing application that were developed to address
the challenges described above.

Fault-Tolerance

Telcos operate huge computing infrastructures and often


face problems such as frequent power and infrastructure outages and human operator errors in conguring the system.
Hence, there is a good chance that a long-running stream
processing application will fail at some point. Hence, a key
requirement is that the system must recover from dierent
kinds of failures and come back up. In addition, no data
should be lost when the system fails. There is a need to

266

Region-Based & Intra-Region Parallelism

(b) A drill-down showing the system call termination reason


for enterprisecustomers for cell-site 201 in last 1 hour.

(a) Showing the number of system and user terminated calls


for enterprise and non-enterprise customers in the last 1 hour.

Figure 3: Real-time visualization of aggregates determined from CDRs

4.2

In-Memory Processing

ing application comes with its own domain-specic rules language. The rules along with enrichment and lookups are
specied in a separate le and this is the interface exposed
to operators for modications to the business logic that runs
as part of the CDR processing application.

In order to improve throughput, an absolute requirement


is not to refer to any les or databases during processing
as far as possible. Hence, all lookup-tables, de-duplication
information, aggregates, etc are maintained in memory. This
does make things interesting for fault tolerance, though. We
also implemented a capability that allowed the operators to
hot-swap the lookup tables in a running application.

4.3

4.7

De-duplication using bloom lters

In order to handle the potential scale of de-duplication,


we chose to use Bloom lters [1] for detecting duplicates.
Bloom lters have excellent scaling and memory properties.
The main choice is in setting an appropriate false positive
threshold. Also, our system has mechanisms for periodically
checkpointing the bloom lters to recover from any software
or hardware failures.

4.4

4.8

Log-Replay based Fault-Tolerance

5.

CONCLUSION

In this paper, we briey presented a stream-processing


based system for processing CDRs. We discussed some of the
requirements, challenges and design decisions in the building
of the application and the supporting infrastructure.

Parallel Insertion into Database

The processed CDRs need to be nally inserted into a


database at very high rates. DB2 has a very useful partitioning feature that makes it possible to insert CDRs into
the same database table in parallel. In some instantiations
of the application, there are 216 parallel DB2 insertion operations running simultaneously.

4.6

Real-time Aggregates & Analytics

An interesting ability oered by Streams is the ability to


analyze data on-the-y, this includes calculation of aggregates over varying time windows (e.g. dropped calls in last
1 hour). The CDR processing application has access to enriched records even before they are inserted into the warehouse, this allows the application to calculate and maintain
several pre-congured aggregates in real-time. An adapter
to a dashboard is then used to visualize such aggregates and
monitor activity as it happens. Such aggregates not only
reduce the load on the warehouse but also enable early detection of anamolies. Figure 3 shows a sample dashboard
that makes use of aggreagtes from the CDR application.

The primary interface for receiving CDRs from network elements is les, the sizes of which can vary widely (from a few
KB to GB). The log-replay based approach to fault-tolerance
exploits the fact that CDRs arrive in les. A checkpointing
mechanism keeps track of the CDRs that have been processed through the Bloom lters and ones which have been
committed to the warehouse. In case of a failure the application is restarted, appropriate checkpoints are loaded and
the required les are replayed to bring the system to a stable state. An interesting optimization that we are currently
implementing is the ability to restart only parts of the application (e.g. the parallel chain that contains the data-ow
operator that failed).

4.5

Master Script and Conguration Files

In order to improve the usability and manageability of the


application, we developed a master script that allows onecommand operations to orchestrate all the moving parts of
the system. Also, all conguration parameters of the application are exposed out in one le. Hence, the operators
need not have to change the actual SPL (Stream Processing
Language) application ever.

6.

REFERENCES

[1] Bloom Filter.


http://en.wikipedia.org/wiki/Bloom_filter, 2012.
[Online; accessed 30-May-2012].
[2] IBM InfoSphere Streams. http://www-01.ibm.com/
software/data/infosphere/streams/, 2012. [Online;
accessed 30-May-2012].

Rules Language

To simplify the specication and modication of rules by


operators who may not be highly skilled, the CDR process-

267

Vous aimerez peut-être aussi