When In-Memory Computing Is Slower Than Heavy Disk Usage

When In-Memory Computing is Slower than Heavy Disk Usage
Kamran Karimi1, Diwakar Krishnamurthy2, Parissa Mirjafari3

Dept of Biological Sciences1,
Dept of Electrical and Computer Engineering2
University of Calgary
Calgary, Alberta, Canada
{kkarimi, dkrishna}@ucalgary.ca
Dept of Chemical and Biological Engineering3

University of British Columbia
Vancouver, British Columbia, Canada
parissa.mirjafari@alumni.ubc.ca
Abstract
Disk access latency and transfer times are often considered to have a major and detrimental impact on
the running time of software. Developers are often advised to favour in-memory operations and
minimise disk access. Furthermore, diskless computer architectures are being studied and designed to
remove this bottleneck all together, to improve application performance in areas such as High
Performance Computing, Big Data, and Business Intelligence. In this paper we use code inspired by real,
production software, to show that in-memory operations are not always a guarantee for high
performance, and may actually cause a considerable slow-down. We also show how small code changes
can have dramatic effects on running times. We argue that a combination of system-level improvements
and better developer awareness and coding practices are necessary to ensure in-memory computing can
achieve its full potential.
1. Introduction
The prevalence of application domains such as High Performance Computing, Big Data, and Business
Intelligence has caused special attention to reducing software running times. However, high software
performance is of interest in nearly all application domains. There are many factors determining
software performance, and disk access is among them. Traditional software development wisdom has
considered frequent disk access to be a source of performance drop. Disks, whether mechanical or SSD,
have orders or magnitude higher latency and transfer times than main memory (RAM). Even casual
computer users know that noticeable slow-down will happen if disk swapping is triggered on their
computers. As a result, considerable effort has been made to minimize disk access, using methods such
as caching [2]. Falling main memory prices has allowed moving further in this direction. Work is in
progress on devising algorithms that perform only, or mainly, in-memory operations [1,5], and some
databases store their data in the main memory as much as possible, either as an option or by default
[3,6].
In most cases an in-memory operation is indeed faster than an equivalent one involving disk access.
Progress in disk manufacturing and software management has mitigated the problem to a certain
amount, but has not completely removed it. We expect clever algorithms to continue to appear to lessen
the reliance on disk access. As price of RAM drops, we see such algorithms applied to bigger datasets.
This venue is so promising that there are major efforts to design computing systems such as HPs The
Machine [7], that have no disks, and rely only on volatile or non-volatile [4] main memory for all their
needs.
In this paper we show that removing or lessening disk access does not necessarily result in increased
software performance, which we simply define as the amount of time it takes a piece of software to
finish running. Actually, the simple examples we use in Section 2 run much faster when frequent disk
access is performed, versus when running in-memory. In Section 3 we argue that achieving the
performance goals promised by in-memory computing and diskless computers may require reexamination of relevant system level algorithms, as well as better training of software practitioners so
they are aware of potential pitfalls. We conclude the paper in Section 4.
2. In-memory vs. disk-only content creation

In this section we write a simple program in Java and Python to generate some data and save them to a
file. First, we will follow the usual wisdom and avoid disk access as much as possible. The contents are
generated in-memory, and after that a single write is used to output them to disk. Although the resulting
code, shown in Appendix 1 in Java and Appendix 2 in Python are developed specifically for this paper, the
inspiration for them has come from examples of real-life, production code.
After this in-memory content creation, we continue by going against the recommendations, and write
code to perform the same operation using very frequent disk access. In this second phase we generate
data in smaller chunks and save them to disk immediately. In both cases we measure the time it takes to
complete the operation. In the in-memory case, we measure the time it takes to prepare the contents,
and also the time it takes to save the contents to disk. In the disk-only approach, there is no in-memory
operation and we only measure the time it takes to perform the disk operations. In both cases we flush
the disk file before closing it to make sure correct disk access measurements are done.
Constructing the file content has a major impact on the codes performance. For the in-memory case, we
define a string to contain the file contents, and use a loop to concatenate another string to it, until a
predetermined file size is reached. This size limit was arbitrarily set to 1,000,000 bytes (less than 1 MB),
which is small fraction of RAM available in most current computers, so the situation described in the
paper happens for even small data sets. We start by adding 1 character (byte) at a time to the content,
so in the in-memory case, the string containing the file will be concatenated 1,000,000 times. In the diskonly case, 1,000,000 disk operations are issued. We then repeat the experiment by adding 10, 1,000, and
1,000,000 bytes at a time.
We expect the results to converge as the size of the string being added to the in-memory file contents
(or saved to disk) increases, since less concatenation and disk operations will be performed. At the limit,
the in-memory file will be a pre-built string of size 1,000,000 which will be saved to disk once., Similarly
for the disk-only method, we will save that string to disk only once.
2.1 Java Experiment

Table 1 shows the Java results when adding to the file contents 1 byte, 10 bytes, 1,000 bytes, and
1,000,000 bytes at a time. The test machine was running RedHat Linux Enterprise 6.5, with 20GB of
memory. Java 6 was used to compile the test code, and a single core was used to execute it. We ran each
experiment 10 times and computed the average running times, as displayed.
In-Memory
disk-only
Added String Length String Concatenation Time Single Write to Disk Time
Writes to Disk Time
in bytes
(ST1)
(DT1)
(DT2)
1
274.9121
0.0055
0.0293
10
26.850
0.0051
0.0073
1,000
0.8559
0.0053
0.0040
1,000,000
0.0
0.0048
0.0038
Table 1. Average Java running times in seconds, measured under Linux.
The running times for Windows appear in Table 2. The test machine was running Windows 7
Professional, with 16 GB of memory. Java 8 was used to compile the test code, and a single core was
used to execute it. As with Linux, each experiment was run 10 times and average times are reported.
In-Memory
disk-only
Writes to Disk Time
in bytes
(ST1)
(DT1)
(DT2)
1
274.626
0.008
0.0295
10
28.783
0.0095
0.0078
1,000
0.713
0.0063
0.0106
1,000,000
0.0
0.0109
0.0094
Table 2. Average Java running times in seconds, measured under Windows.
The absolute running times reported in the above tables are mainly determined by the specific hardware
and system software used, so we are more interested in the relative differences in executions times. The
graph in Figure 1 shows the total speedup of the disk-only versus in-memory approaches for both Linux
and Windows, calculated as the sum of in-memory string and disk times, divided by the disk-only time.
Both axis are in logarithmic scale. As can be seen, with single-byte increments the disk-only approach is
about 9000 times faster than the in-memory approach for both operating systems. In other words,
calling a million disk operations is performed about three orders of magnitude faster than in-memory
string concatenation, which is contrary to current code development wisdom. Both Linux and Windows
graphs lines are nearly linear, confirming that reducing string operations reduces running times. As
expected, the in-memory approach catches up to the disk-only version at the end, when the two
algorithms are basically doing the same thing.
10000
1000
Total speedup (Linux)
100
Total speedup (Windows)

10
1
1
1000
1000000
Figure 1. Disk-only vs. in-memory speedup for Java: (ST1+DT1)/DT2

Figure 2 shows the speedup in disk operations for the in-memory case (a single access) vs. the disk-only
case (many disk accesses), calculated as disk-only time divided by the in-memory disk time.. As expected,
less time is spent in the in-memory case to perform disk operations, but this is insignificant compared to
the total running time of the code, so this advantage is useless. Theoretically, both times should be equal
at the end, with the ratio approaching one. Normal system performance variations beyond user control
have caused the ratio to get close to one, but not exactly reach it. These variations are more significant
when running times are shorter, as happens at the right hand side of the graph.
6
5
4
Disk speedup (Linux)
Disk speedup (Windows)

2
1
0
1
10
100
1000
10000
100000 1000000
Figure 2. In-memory vs. disk-only speedup for Java: DT2/DT1
As can be seen from Figures 1 and 2, the same phenomenon is observed in both Windows and Linux. This
could be influenced by the Java Virtual Machine, which creates a layer of uniformity in both cases.
2.2 Python Experiment
To test the case where no virtual machine is intervening, we tried the Python code in Appendix 2. Table 3
shows the Linux results, where Python 2.6.6 was used to run the code.
In-Memory
disk-only
Writes to Disk Time
in bytes
(ST1)
(DT1)
(DT2)
1
78.057
0.00179
0.393
10
20.480
0.00077
0.0379
1,000
0.343
0.00145
0.0041
1,000,000
0.0
0.00168
0.0022
Table 3. Average Python running times in seconds, measured under Linux.
Windows results come in Table 4, and were generated using Python 2.7.6.
In-Memory
disk-only
Writes to Disk Time
in bytes
(ST1)
(DT1)
(DT2)
1
113.537
0.00547
0.205
10
11.925
0.00565
0.0254
1,000
0.1238
0.0126
0.0055
1,000,000
0.0
0.00587
0.00489
Table 4. Average Python running times in seconds, measured under Windows.
The change in relative performance is not as linear as in the Java case, but with Python we observe the
same phenomenon under both Windows and Linux, where for shorter concatenated strings the inmemory computation is hundreds of times slower than the disk-only case. As a side note, Python seems
to be more efficient than Java in our string tests, but we are only interested in the relative performance
of the in-memory and disk-only approaches. The point to consider is that even though the two languages
and run time systems are different, the general performance trend is comparable between Java and
Python, pointing to a systemic and language-independent phenomenon.
2.3 Explaining the Results
It is easy to explain the results: In high-level languages such as Java and Python, a seemingly benign
statement such as concatString += addString may actually involve executing many extra cycles behind
the scenes. To concatenate two strings in a language such as C, if there is not enough space to expand
the concatString to the size it needs to be to hold the additional bytes from addString, then the
developer has to explicitly allocate new space with enough storage for the sum of the sizes of the two
strings and copy concatString to the new location, and then finally perform the concatenation. In Java
and Python strings are immutable, and any assignment will result in the creation of a new object and
possibly copy operations, hence the overhead of the string operations. The disk-only code, although
apparently writing to the disk excessively, is only triggering an actual write when operating system
buffers are full. In other words, the operating system already lessons disk access times. A developer
familiar with the language and system internals readily notices the causes of this observed behaviour,
but this behaviour may be easily missed, as indicated by examining similar cases in production code.
The above explanation applies to any data structure that has to be stored contiguously and increases in
size, or is immutable. It is possible to improve the above slow execution. For a mutable string, one can
allocate more storage than is immediately needed for a concatenation operation. Also, if the data
structure, a string in this case, is stored on the stack, then it may be possible to perform concatenation
very efficiently, by placing the added string at the top of the stack and simply adjusting the stack pointer.
Since the stack size can be increased very easily, and no copying of the whole string is required, in this
case an in-memory operation will be efficient.
With our Python code both Python implementations were slow in string concatenation, but that does
not necessarily mean they store the strings in the land-locked heap space. This is because the
concatenation code in Appendix 2, concatString = addString + concatString, places the addString
contents at the beginning of the target string (vs. the end), so moving concatString is required, and a
stack allocation does not help with performance. To verify this statement, we re-ordered the
concatenation statement to concatString = concatString + addString, and ran the code on both Windows
and Linux again. Windows slow concatenation results did not change, while for Linux the modified inmemory version was faster than a disk-only version, hinting at a stack allocation scheme. Table 5
provides the Linux running times for the modified code, where the in-memory versions performance
dramatically improved when string concatenations were not accompanied by copy operations.
In-Memory
disk-only
Writes to Disk Time
in bytes
(ST1)
(DT1)
(DT2)
1
0.284
0.00155
0.423
10
0.0276
0.000135
0.0405
1,000
0.00109
0.00144
0.00283
1,000,000
0.0
0.00160
0.0023
Table 5. Average modified Python running times in seconds, measured under Linux.
As a last test, we declared concatString to be a global variable, which sets the allocation scheme to
heap, so the variable is accessible from other name scopes. In the modified concatenation Python code,
the in-memory execution times increased dramatically to the values reported in Table 3, confirming our
explanation. In other words, a global scope guarantees slow in-memory execution of both the original
and modified Python code.
Java performance numbers did not change when the concatenation order was reversed in the code in
Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically
improved the results.
It is important to emphasize that the in-memory performance problem is not caused by heap versus
stack memory allocation, as evidenced by poor in-memory results of the Python code in Appendix 2. The
problem is caused by data copy operations in the main memory, whether in the heap or stack space. We
also see that immutable strings are not inherently a problem, as evidenced by Pythons much better
performance with the modified code.
3. Discussion
These widely varying performance results were obtained by small changes in the code, and
understanding the reasons required closer look at the system-level execution environment. Most recent
code development efforts concern higher-level domains such as web development. Software
practitioners at this level often work with multiple layers of abstraction, well away from the operating
system and language run time levels. In the examples of inefficient code that inspired this paper, and we
suspect is many other similar cases, the developers have done what they have been trained to do,
carefully reducing disk access, but the approach is obviously failing.
We feel that with the push towards computing systems with huge amounts of memory, and ultimately to
diskless systems which rely only on main memory, there is a necessity to have a holistic approach to
ensuring high software performance. To fully utilize the emerging hardware, we need to re-examine how
operating systems and language run times manage and utilize memory. We also need to make sure
current and future developers are familiar with the fundamental concepts and principles that impact
software performance.
4. Conclusion
Although in numerous cases in-memory computing is faster than an equivalent algorithm that accesses
disk, the real-life-inspired counter examples presented in this paper show this to be not always the case.
We argued that in-memory computation cannot guarantee high software performance, and careful
examination of the code, along with knowledge of hidden factors such as system and language library
routines and operating system internals have an important role in the achieved performance, or lack of
it. More specifically in our case, memory management caused a significant slow down for in-memory
computation. Many of the factors affecting performance are outside developers care or control, and
they may not even be aware of the underlying algorithms and implications. This justifies our emphasis on
1) re-examining system-level algorithms with in-memory operations in mind, and 2) better training to
make developers familiar with system-level software intricacies. Doing so would help in-memory
computing better deliver on its potentials and promises.
References
[1] Gill, J., Shifting the BI Paradigm with In-Memory Database Technologies, Business Intelligence Journal,
volume 12 (2): 5862, 2007
[2] Karedla, R., Love, J.S., and Wherry, B.G., Caching strategies to improve disk system performance,
Computer, volume 27.3: 38-46, 1994
[3] Kreibich, J.A., Using SQLite, O'Reilly Media, Inc., 2010
[4] Lacaze, P.C. and Lacroix, J.C., Non-volatile Memories, John Wiley & Sons, 2014
[5] Narayanan, D. and Hodson, O., Whole-system Persistence with Non-volatile Memories, Seventeenth
International Conference on Architectural Support for Programming Languages and Operating Systems
(ASPLOS 2012), London, England, UK, March 37, 2012
[6] Tiwari,S., Professional NoSQL, John Wiley & Sons, 2011
[7] http://www.businessinsider.com/hp-shows-off-new-kind-of-computer-2014-6
Appendix 1. Java code

// By Kamran Karimi (kkarimi@ucalgary.ca)
// Complete version. Runs each experiment 10 times
import java.io.IOException;
import java.io.FileWriter;
import java.io.BufferedWriter;
class Test {
public static void main(String[] args) {
// number of increments for the file content.
int numAdd= 1; //Additional changes are needed when NUM_ADD = 1e6
int NUM_ITERATIONS = 10;
long totalMemory = (long)1000000; // total amount of memory in bytes
String addString = "";
for (int i = 0; i < numAdd; i++) {
addString += "1";
}
double[] stringTimes = new double[NUM_ITERATIONS];
double[] fileTimes1 = new double[NUM_ITERATIONS];
double[] fileTimes2 = new double[NUM_ITERATIONS];
for (int count = 0; count < NUM_ITERATIONS; count++) {
stringTimes[count] = fileTimes1[count] = fileTimes2[count] = 0;
BufferedWriter writer;
// First part: in-memory
long numIter = totalMemory / addString.length();
String concatString = "";
long startTime = System.currentTimeMillis();

for (int i=0; i < numIter; i++) {
concatString += addString;
}
long endTime = System.currentTimeMillis();
double stringTime = (endTime - startTime) / 1000.0;
stringTimes[count] = stringTime;
try {
writer = new BufferedWriter( new FileWriter("test.txt"));
startTime = System.currentTimeMillis();
writer.write(concatString);
writer.flush();
writer.close();
endTime = System.currentTimeMillis();
}
catch ( IOException e) {
}
double fileTime = (endTime - startTime)/1000.0;
fileTimes1[count] = fileTime;
// Second part: disk-only
try {
writer = new BufferedWriter( new FileWriter("test.txt"));
startTime = System.currentTimeMillis();
for (int i=0; i < numIter; i++) {
writer.write(addString);
}
writer.flush();
writer.close();
endTime = System.currentTimeMillis();
fileTime = (endTime - startTime) / 1000.0;
fileTimes2[count] = fileTime;
}
catch ( IOException e) {
}
}
double stringMean = 0;
for(int i = 0; i < stringTimes.length; i++){
stringMean += stringTimes[i];
}
stringMean /= stringTimes.length;
double fileMean1 = 0;
for(int i = 0; i < fileTimes1.length; i++){
fileMean1 += fileTimes1[i];
}
fileMean1 /= fileTimes1.length;
double fileMean2 = 0;
for(int i = 0; i < fileTimes2.length; i++){
fileMean2 += fileTimes2[i];
}
fileMean2 /= fileTimes2.length;
System.err.println("In-memory mean: string time " + stringMean);
System.err.println("In-memory mean: file time " + fileMean1);
System.err.println("Disk-only mean: file time " + fileMean2);
}
}
Appendix 2. Python code

# By Kamran Karimi (kkarimi@ucalgary.ca)
# Short version. Runs each experiment once
import timeit
numAdd = 1 #Additional changes are needed when numAdd = 1000000
totalMemory = 1000000 # bytes
#global concatString # global ensures a slow-down under Linux
addString = ""
for i in range(0, numAdd):
addString = addString + "1"
# First part: in-memory
numIter = int(totalMemory / len(addString))
concatString = ""
f = open('test.txt','w')
start = timeit.default_timer()
for i in range(0, numIter):
concatString = addString + concatString # modified: concatString = concatString + addString
stop = timeit.default_timer()
stringTime = stop - start
f.write(concatString)
f.flush()
f.close()
fileTime = stop - start
print "in-memory: String took " + str(stringTime) + ", file took " + str(fileTime)
# second part: disk-only
numIter = int(totalMemory / len(addString))
f = open('test.txt','w')
for i in range(0, numIter):
f.write(addString)
f.flush()
f.close()
fileTime = stop - start
print "disk-only: file took " + str(fileTime)

When In-Memory Computing Is Slower Than Heavy Disk Usage

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

When In-Memory Computing Is Slower Than Heavy Disk Usage

Transféré par

Droits d'auteur :

Formats disponibles

When In-Memory Computing is Slower than Heavy Disk Usage

Kamran Karimi1, Diwakar Krishnamurthy2, Parissa Mirjafari3

Dept of Chemical and Biological Engineering3

2. In-memory vs. disk-only content creation

2.1 Java Experiment

Total speedup (Linux)

Total speedup (Windows)

Figure 1. Disk-only vs. in-memory speedup for Java: (ST1+DT1)/DT2

Disk speedup (Windows)

Figure 2. In-memory vs. disk-only speedup for Java: DT2/DT1

Appendix 1. Java code

long startTime = System.currentTimeMillis();

Appendix 2. Python code

Vous aimerez peut-être aussi