Académique Documents
Professionnel Documents
Culture Documents
Aggregates
Posted by Jens Krueger (https://blogs.saphana.com/author/jenskrueger/) on January 14, 2015
More by this author (https://blogs.saphana.com/author/jenskrueger/)
Tags: sap hana (https://blogs.saphana.com/tag/saphana/) sap simple finance (https://blogs.saphana.com/tag/sapsimplefinance/)
Imagine you calculate the sum of a long series of numbers in your head, for example, the sales of the current month. For you as a human
being, it probably feels natural to write down the final result as soon as you have finished the calculation. You do this because you are afraid
to forget the result and to have to start all over.
For a computer, the situation is different. A computer is blazingly fast at doing such simple calculations. There is no need to physically store
the aggregate – the result of such an aggregation – in a database. The computer can always quickly calculate the result again and is not
afraid of the work which would feel tedious for a human being. In fact, it would be much more tedious for the computer (and any program
running on it), to update the stored aggregate figure every time the base data changes. For example, if a new sales order comes on, he
would not only need to record the new item in the database, but also update the stored aggregate, requiring additional expensive modifying
operations.
If this scenario sounds familiar to you, that may be from the first two parts of this series on SAP Simple Finance, where we discussed the
removal of redundancy in SAP Simple Finance (https://blogs.saphana.com/2014/09/30/howsimplefinanceremovesredundancy2/) and in
particular the case of materialized views (https://blogs.saphana.com/2014/10/27/howsimplefinanceremovesredundancythecaseof
materializedviews/), physical database tables that store query results. Materialized aggregates lead to similar issues as those mentioned
previously and have nevertheless similarly been used in the past for performance reasons. As materialized aggregates have some distinct
characteristics, let’s take a closer look in this article, which again is accompanied by a longer version that goes into more detail than the
summary.
Compared to materialized views in general, materialized aggregates are a special case due to key differences in several areas:
1. Underlying query: Their query involves aggregation (groupby clause in the query) and calculation, not only a selection and projection
of the base tuples of some database relation.
2. Structure: The tuples stored in the materialized aggregate are not identical replicas of some other data in the database, but have a
different structure. Typically, some identifying attributes, the dimensions, form the key and other attributes represent the aggregated
values. In the example from the beginning, the month dimension would be part of the key and the calculated sum of sales the
aggregate value, or fact.
3. Data distribution: Typically, the table of the materialized aggregate contains fewer tuples than the base table because of the
aggregation.
For a base table such as the table of accounting document line items (BSEG), many attributes are possible dimension, which may be applied
to an aggregation in any combination. For example, for different analysis questions the line items may be grouped by profit center, period,
and account or only by period. Figure 1 displays the resulting hypercube of different materialized aggregates originating from BSEG and
these three dimensions.
(http://1h9mo84f79012h5z12640ehswpengine.netdnassl.com/wpcontent/uploads/2015/01/blog_simplefinance_v2.1_figure4.png)
Figure 1: Example hypercube over accounting document line items
SAP HANA makes it feasible to get rid of materialized aggregates without compromising performance, as we show in the detailed calculations
below. Comparing different materialization strategies and onthefly calculation, it becomes apparent that even with one billion of line items
we can dispense of materialization altogether. To stay in the picture of the opening statements: writing down the result of each calculation is
not necessary. In the example introduced above in Figure 8, an inmemory database is able to calculate onthefly aggregate values by profit
center and period in less than half a second. As you increase the number of attributes to be considered as possible dimensions beyond the
simple example with three attributes, the tradeoff between performance and materialization shifts even more in the favor of onthefly
calculation. More possible combinations due to more dimensions mean that the materialized aggregate tables will contain more tuples,
diminishing the gain of fewer tuples to scan in case of queries.
An interesting situation highlighting the benefit of onthefly calculation occurs if you include more and more dimensions in aggregations to
remain flexible with regard to possible analysis queries by any reasonable combination of dimensions. If you or your traditional database
management system decide to materialize those (which would not be necessary inmemory, as outlined above), you may quickly reach a
point where the number of tuples stored overall in different materialized aggregates of the same hypercube exceeds the size of the base
table. This scenario wipes out any performance impact of the materialization, because scanning the base table will be faster than looking for
the particular materialized aggregate value within the larger number of tuples across different vertices of the hypercube.
In addition to the performance, you should also consider the significant gain in flexibility of onthefly calculation. Instead of being limited to
predefined aggregates, you can freely explore according to the analysis needs, thanks to the performance of SAP HANA.
In that spirit, the series of blog posts on SAP Simple Finance will continue with and overview and subsequent exploration of the business
value of SAP Simple Finance.
This blog post continues after the break with an indetail look at the points that we have summarized so far. First, we introduce the concept of
materialized aggregates and hypercubes. We then look at general strategies for materializing aggregates and the role of aggregation in
financial accounting.
The Concept of Materialized Aggregates and Hypercubes
A materialized aggregate is a materialized view with aggregation (Müller, Butzmann, Klauck, and Plattner: “An Adaptive Aggregate
Maintenance Approach for Mixed Workloads in Columnar InMemory Databases”; in: Proceedings of ACSC 2014). It is the materialization of
a query that aggregates the tuples of its base relations, grouping them by certain attributes and using an aggregation function such as COUNT
or SUM. Thus, it provides access to calculated and continuously synchronized values that aggregate individual items according to specific
criteria.
The following SQL template describes the general structure of an underlying simple query, where AGGx represents any aggregation function.
SELECT A_1, A_2, …, A_k, AGG1(V_1), AGG2(V_2), …
FROM baseTable
WHERE C_1=’…’ AND …
GROUP BY A_1, A_2, …, A_k
The set of attributes used in grouping will vary depending on the analysis question. In general, only a subset of all columns of the base table
will be at all relevant for being part of the grouping statement of any query. These columns form the analysis dimensions. However, in order
to not restrict the analysis possibilities when thinking of materialization, the dimensions should be an intentionally broad set. Out of the
dimensions, a specific analysis question may require grouping by any subset of dimensions. In case of a separate data warehouse system or
a separate materialization mechanism, this system needs to consider all dimensions and therefore store the data grouped at least on the
lowest level of granularity.
If we ignore hierarchies within a dimension (for example, different levels of time or customer groups), the possible analysis queries over a
specific base relation with k relevant attributes form a hypercube. Each possible view is a vertex of this kdimensional cube. The view
grouping by all relevant attributes A1, …, Ak is the most detailed vertex (lowest granularity of grouping). Each nextlower level of detail
consists of all views that leave out one more attribute than the higher level.
An example threedimensional hypercube over the accounting document line items has been displayed in the Figure 1 above. For reasons of
simplicity and to not overcomplicate the explanations, we initially only use period ( ), G/L account ( ) and profit center ( ) as dimensions.
We will later extend the example to include more attributes as dimensions. For our discussion, we assume the following cardinalities in the
system:
1 billion line items,
1000 accounts,
100 profit centers, and
100 periods.
To put the number of periods into comparison: 5 years with 12 months per year amount to 60 periods (months). If data were aggregated by
week, the same time span of 5 years would already lead to more than 260 periods (weeks). A more finegranular aggregation on daily basis
would lead to at least 1825 periods (days).
In this example, the highestdetail vertex is the view grouping by all dimensions < , , >. Assuming no correlation between dimensions, it
could have values for up to 10 million cells (100 periods × 1000 accounts × 100 profit centers). Let us assume that half of the combinations
do not occur in the base relation, so that the top view contains 5 million tuples (the numbers for the other vertices are similarly based on
assumptions and displayed on the side of each box in Figure 1). Each tuple could have one or several aggregate values, for example, the
number of items and the sum of the debit amounts. On the next lower level of detail, each of the vertices represents a view whose grouping
leaves out one attribute. For example, the view < , > on the second level leaves out the account ( ) from the grouping. Each lower level
of detail consists of all permutations with one attribute less than the level above. The lowest level of detail consists of the single vertex that
groups by no attribute and thus sums over all entries.
Materialization Strategies and Performance
It would be possible to materialize all or some of the views represented by vertices in the hypercube. If it is being discussed to materialize
some aggregates, the highestdetail vertex would be an obvious candidate. Since each vertex can be calculated from the vertices of higher
levels that group by a superset of its attributes (for example, < > can be calculated from < , > by grouping on ), all other vertices could
be calculated based on this materialized aggregate. Further vertices could be selected for materialization based on the benefit incurred by
having it materialized. This benefit is proportional to the difference in the number of tuples compared to the higherlevel vertices, as this
difference represents the aggregation effort: a materialized view that contains 2.5 million tuples when a materialized view of one level above
contains 5 million brings less benefit compared to a materialized view where only 5000 entries remain. A query on the first materialized view
would still have to scan through 2.5 million tuples (that is, half of the higherlevel view) to find the answer for a given query on the aggregate,
so that the materialization brings almost no performance gain. In contrast, having only 5000 entries to scan for a given question considerably
improves the performance and may be worthwhile, depending on the requirements.
We can ignore the effort required for the actual calculation in all of these estimates, because the I/O activities while scanning the column
actually dominate the CPU time required for the (simple) calculations. The CPU remains mostly idle while data is continuously loaded from
memory into the cache (at a processing speed of 4 MB/ms/core).
Let us consider the query for the sum of sales for profit center 123 in January 2014 under three different scenarios of materialization. The
attribute vector for profit center has a width of 10 bit to be able to represent all 1000 profit centers (period: 7 bit for 100 periods).
Number of entries to
Scenario scan Scanning time (independent scans)
Period: 834 MB / 4MB/ms/core = 209 ms / core
Period: 4.3 KB, 1 μs
As evident from these backoftheenvelope calculations, materializing the highestdetail vertex is not necessary for performance reasons,
but would at least in this example be noticeable for users – although they probably wouldn’t mind the extra waiting time of less than half a
second. Already the subsequent reduction achieved from materializing the vertex of the lower level of detail is negligible under most workload
patterns. Hence, at most materializing the highestdetail vertex may be reasonable in this simple example if performance expectations
require it.
In the example above, the highestdetail vertex stood for a significant reduction in the number of entries. However, this came at the cost of
only considering three attributes (out of close to 400 of the table BSEG in SAP ERP Financials) as interesting for grouping. As outlined above,
a lot more would be possible dimensions in reality. The question is: at what point does it make sense to materialize at least the highestdetail
vertex? As this trades flexibility and database size on the one hand with query performance on the other hand, the answer depends on the
preferences at hand. Furthermore, the disadvantage of materialization with regard to throughput has to be taken into account as well. In fact,
the impact of materialization on the complexity of data modifications is even more pronounced for materialized aggregates, because
identifying all the materialized aggregate values to be updated after modifications in the base tables requires more consideration. Additionally,
the same materialized aggregate will need to be updated for different underlying changes, because it aggregates more than one line item.
For example, a total balance for a certain account will be synchronized whenever a line item for that account is posted. Thus, contention
through locking will happen more frequently, further reducing the throughput of a system with materialization in place. In effect, materializing
aggregates trades query performance for modification performance and transactional throughput. Taking into account that the impact on
query performance is diminishing thanks to inmemory speed, only the disadvantages of materialized aggregates especially in case of data
modifications remain.
In addition, there are certain cutoff points at which it does not make sense to materialize an aggregate irrespective of the individual
preferences. To be of any benefit, each materialized aggregate should contain fewer tuples than the base relation. Otherwise, scanning the
materialized view for the answer is not significantly faster than accessing the same number of entries directly in the base table. As a lot of
attributes may be of potential interest, it is not given that the highestlevel vertex contains significantly fewer entries than the base relation, as
the grouping will be rather finegranular. The same condition needs to hold true for all lowerlevel aggregates – their size in terms of tuples
needs to be smaller than any other materialized view from which it could be derived.
In addition to these hard constraints on each individual materialized view, the complete set of materialized aggregates needs to be
significantly smaller in terms of the number of tuples than the base table. Otherwise, the same performance situation as described above
applies as the database system has to access the corresponding materialized aggregate data: if the hypercube over all vertices contains
more tuples than the base table, the system in effect has to consider more tuples in order to access a specific aggregate value than a full
scan of the base table would require.
In a scenario of k dimensions, there are 2k vertices in the hypercube. If the materialization strategy is too liberal and there are many
dimensions, it is easily possible to end up with more tuples in the set of all materialized views than there are in the base table. Imagine again
a base table with 1 billion tuples. Already with only 30 dimensions, there will be more than 1 billion vertices in the hypercube (230 ≈ 1.07
billion). Even if only every thousandth vertex is being materialized, each of the resulting vertices must not contain more than 1,000 entries on
average – otherwise, the total number of tuples in materialized views will already be greater than the number of entries in the base table. If,
for example, the average number of tuples per vertex was 1,500, the system would have to consider more than 1.6 billion aggregate tuples. It
is easy to imagine based on the performance calculations above that this may take longer than just scanning the base table and calculating
the aggregate on the fly – not to speak of the associated storage costs. This also holds in view of the fact that each query will only access a
specific vertex: the system still has to look for the tuples belonging to that specific combination of dimensions.
A direct access to the tuples of that specific vertex would be comparable to an index and has conceptually no significant impact in an in
memory system. Thanks to the sequential scanning speed in memory, an index has considerably small impact on performance.
Only relying on a possibly fixed set of materialized aggregates for analysis purposes is dangerous due to the ensuing loss of information. If
the performance of the underlying (diskbased) database system does in contrast to SAP HANA not allow free exploration of aggregates
(calculated onthefly) and drill down to the original line items themselves, relevant information will be hidden from the user. Only with flexible
instead of fixed aggregates based on inmemory technology is it possible to spot the following situations:
The revenue of a company seems to have reassuringly increased over the last quarter. However, when looking deeper at the individual
sales themselves, one single close “saved” the total numbers, while, overall, business has been declining.
In another company, profit has been stable over the last years on group level as well as for individual products. In fact, however, all
profit essentially comes from sales to a single customer. The predefined materialized aggregates do not allow fast access to cover all
possible scenarios.
Trends that are visible on the lowest level get hidden as managers only have access to certain highlevel aggregations.
Aggregation in Financial Accounting
So far, we have looked at a simple example to explain the basic concepts. Although the materialized aggregates in this example significantly
reduce the number of tuples to scan for queries, materialization is at best borderlinebeneficial. Its benefits appear even more doubtful if you
take into account that the example dramatically reduces the flexibility by restricting the aggregation to only three dimensions. Based on this,
reports could only answer those queries that aggregate by one of the options in the hypercube. In reality, there will be a lot more dimensions
of interest to analyses. Hence, we now look at a more complex example.
The following table lists a set of fields from BSEG that are the absolute minimum for any proper analysis source on accounting documents. The
last column gives an estimate of the number of unique dimension values per field (the numbers are a rough approximation and meant to give
an impression of the order of magnitude – they will of course differ between systems). The product of these cardinalities would be the
maximum number of cells in the highestdetail vertex of the hypercube. Even with this still rather small selection of fields, this amounts to 2
octillion cells (that’s a two with 27 zeroes)! Obviously, only a small subset will actually be filled, as for many combinations of dimension values
there will be no entries in the base table. If there are 1 billion line items, only every twoquintillionth cell of the highestdetail view can be filled
at all. To reach a substantially smaller highestdetail vertex, compared to the base relation, this factor has to be still larger, meaning a
substantial amount of correlation between dimensions.
Company (code) BUKRS 50
Fiscal Year GJAHR 5
Fiscal Period (e.g., month) H_MONAT 16
Posting key (determines BSCHL 50
account type, debit/credit, …)
Status (open / cleared) AUGBL=’’ 2
Functional area FKBER 10
The numbers highlight: the more fields there are that may be interesting to include in an analysis, the more likely it is that the highestlevel
vertex of the hypercube will not be significantly smaller in the number of tuples. The number of entries that can actually be grouped together
on this level will be small because there will be few entries with matching values in all dimensions. This means that materializing the highest
detail vertex will not significantly reduce the number of tuples to scan.
When creating a materialized aggregate, it is not possible to foresee all analysis questions of interest in the future. In order to not restrict the
flexibility to certain predefined exploration paths, it is advisable to opt for a broad set of dimensions to include. As outlined above, the most
relevant factor when deciding whether to materialize aggregates is the reduction in the number of tuples. This should be estimated based on
sample data to get a feeling for how sparse the highestdetail vertex is. The savings in the number of entries will only linearly influence the
performance of aggregate queries. Whether a factor of 10 will be worth the effort and costs of a materialized aggregate depends on the
performance without materialization compared to expected response times – which inmemory speed will help to achieve more easily also
without materialization.
On a side note: there may be extreme cases where it nevertheless may be reasonable to materialize certain vertices of the hypercube, if the
reduction in the number of tuples to scan that is achieved by such a materialization is significant. In any case, this should be handled
dynamically by the database itself based on an analysis of queries and response times. It should not have to rely on a fixed materialization
strategy set beforehand by the application and, more importantly, keeping the redundant data in sync must not be the task of the application
itself. In order to not increase application complexity and to integrate better with the database, the database needs to handle identification
and update of materialized aggregates by itself. To this end, the concept of aggregate caching has been proposed for inmemory databases
utilizing the maindelta architecture (Müller and Plattner: “Aggregates Caching in Columnar InMemory Databases”; Proceedings of the 1st
IMDM at VLDB 2013).
Deep knowledge of the statistical distribution of the data is necessary for deciding on a materialization strategy. When trying to thus optimize
the queries by materializing select aggregates, it has to be kept in mind that slight changes in the statistical distribution of the underlying data
may require reoptimization. Contrast this with the fact that a column store exhibits a stable query performance which is less affected by the
data characteristics.
Materialized aggregates also restrict the flexibility with regards to applying filters or performing calculations before aggregation. For example,
an analysis that only wants to look at sales of more than one million Euros cannot operate on already aggregated data, because the
aggregate values have lost the necessary information. Similarly, the loss of information from aggregation means that certain (socalled non
distributive) functions cannot be part of analysis questions, if the materialized aggregate table does not store additional information
beforehand. For example, calculating the average sales amount (assuming the average itself is not part of the materialized aggregate table)
needs both the sum of sales and the number of sales to be stored in the materialized aggregate table. Other functions such as minimum or
median even need more finegranular additional information (these are called nonalgebraic or holistic). These different classes of
aggregation functions also have an effect on whether it is easily possible to update materialized aggregates in the event of inserts, updates,
or deletes of the line items.
Conclusion
As explained and demonstrated in this and the preceding blog posts, not having to rely on materialization has substantial benefits. Onthefly
calculation offers more flexibility and lowers the total cost of ownership compared to materialized views and materialized aggregates (thanks
to higher throughput and lower database footprint).
The redundancy of materialized views and materialized aggregates was necessary in traditional database systems to achieve adequate
performance. With the inmemory database SAP HANA, it is indeed feasible to eliminate redundancy and have a highly performant system.
We have demonstrated both theoretically (https://blogs.saphana.com/2014/10/27/howsimplefinanceremovesredundancythecaseof
materializedviews/) and at the example of SAP Simple Finance (https://blogs.saphana.com/2014/09/30/howsimplefinanceremoves
redundancy2/) how getting rid of redundancy is possible and beneficial. SAP Simple Finance means Zero Cost for Redundant Data and
unprecedented levels of analysis flexibility, all in one system.
Throughout this series, we thus have so far explained the benefits that arise from the new technical possibilities. Most importantly, these are
increased flexibility,
higher throughput, and
lower database footprint.