ITDSD - 4. Quantitative Analysis of Distributed Software - CodeProject

22/07/2019 ITDSD - 4.
Quantitative Analysis of Distributed Software - CodeProject
ITDSD - 4. Quantitative Analysis of Distributed Software

sun.shuo@aliyun.com
15 Jun 2019 CPOL
Introduction to Distributed System Design - 4. Quantitative Analysis of Distributed Software
Introduction
This is the fifth article on distributed architecture. This paper mainly introduces how to calculate the success rate of a single request
and the stability rate of repeated requests in a distributed system to obtain the stability evaluation of the system. Evaluate
performance and redundancy based on software architecture. Potential performance bottlenecks are identified through system
analysis. It provides data support for the design of distributed system.
You can click on the links below to find the first four articles:
Distributed Method of Web Service

ITDSD - 1. Splitting in Microservice Architecture
ITDSD - 2. Practice of Splitting in Microservice Architecture
ITDSD - 3. Overview of Distributed Software Engineering
A Network With Shared Data Must Be an Asynchronous

Network
When multiple tasks attempt to use one data together. Tasks usually need to queue up to use this data in turn. Because using data
at the same time can cause data corruption, it is common to modify part of the data for each task separately, resulting in overall
data corruption. The case that multiple tasks cannot run simultaneously on a single data is called data atomicity. At this time, data is
a protocol with consistency function among multiple tasks, so it plays an irreplaceable role in the operation relationship of multiple
tasks. The alternate use of data ensures the exchange of information between tasks through the address of data. The asynchronous
model and synchronous model describe the execution state of tasks when using data. For example, one task is using data while
another task is waiting. The task waiting is called synchronous execution.
https://www.codeproject.com/Articles/5114968/ITDSD-4-Quantitative-Analysis-of-Distributed-Softw?display=Print 1/12
22/07/2019 ITDSD - 4. Quantitative Analysis of Distributed Software - CodeProject
Figure 1.
In a distributed environment, the client and the server trigger the execution of related tasks through messages. The client's task
sending request triggers the server's task to be executed. Usually, the client will block the execution of the task and wait to return
before the server executes the request and returns. The blocking of the client ensures that the state of the client does not change
until the server returns. This allows the server to continue processing on the current client after it returns. We call this a time when
the client executes the request synchronously. The server executes the request asynchronously. Because the service will not block
after processing this request and wait for other instructions from the client. This ensures that the service can process other client
requests in turn. Waiting by connection or by processing request is an important difference between synchronization and
asynchrony.
Figure 2.
In order to prove that the network composed of single-threaded servers with shared data is necessarily asynchronous, first, it is
assumed that both client and server use synchronization after creating links, and can share data with other clients. That is, after any
party has processed the message, it will block and wait for the other party to send the new message. This way of linking is called
fully synchronized mode. In fully synchronized mode, as long as the client maintains a link, the server cannot process other client
requests. Assuming that the link time of the client is infinite, the time taken by the server is infinite. Then the server will not be able
to handle other client requests for an infinite period of time. This is shown in Figure 3.
Figure 3.
The server must not be able to share data with other clients in an infinite amount of time. This contradicts our hypothesis. So we get
theorem 1.
Theorem 1. A Network Composed of Single-Threaded Servers Sharing Data Must

Be an Asynchronous Network.
An asynchronous network with shared data is characterized by consistency based on delay time t. The so-called consistency of delay
time t means that the information obtained by any party who obtains the information is the information after the delay time t. In
other words, the information obtained by any receiver is the information before the time t of the sender. The sender of the message
does not guarantee that the information will change during this period of time. The delay time t is a corollary based on the
conservation of energy. Because the server hardware obeys the law of capacity conservation. Therefore, the transmission of any
information on the server hardware also obeys the law of capacity conservation. So we get reasoning 1.
Inference 1, Information Synchronization Between Any Two Points Will Have a

Delay Time T.
Inference 1 shows that the data obtained at any point is nominally older. For example, the game score queried by the user on the
server. The server may have been updated to a different score when the user sees the data. If in order to ensure that users see the
data in its use process to ensure consistency. So to lock the server is to make the link synchronization mode. Theorem 1 shows that
the network in synchronous mode cannot share data. Inference 1 also shows that the transmission of information is directional. This
transfer triggers the execution of tasks, and the resulting information is directional. Following this orientation, the basis for
quantitative analysis of distributed networks emerges. Because in the direction of task execution, the hardware and software
involved constitute a distributed system.
Probability of Serialization and Concurrency

When the client sends the request to the distributed system, the distributed system executes the request on a series of servers. Mark
these servers as 1 to n, where n is an arbitrary integer. Assume that each server has a probability of error. This error includes
hardware downtime, software crash, etc. If we record the probability of error as x1, we can know from the probability formula that
the probability of successful serial execution of multiple servers is x1*x2*x3... Xn, we get formula 1. For example, there is a task
request that needs to be returned to the client through the gateway server, web server, memory database and hard disk database.
Assuming that these four servers are 99%, 90%, 96% and 96%, the probability of successful requests is 0.99 * 0.9 * 0.96 * 0.96 = 0.82.
Although the probability of failure of a single request is high, it can be known by the concurrency probability formula when the user
repeats the request. The probability of success for independent event duplicate requests is 1 - ((1-x1)* (1-x2)* (1-x3)... (1-xn). We get
formula 2. The probability of two repeated requests passing at any one time is calculated to be 1-(1-0.82)*(1-0.82) = 0.97, so this
repeated request increases the probability by about 15 percentage points. It can be seen that the success rate of repeated requests
increases, which offsets the decline of the success rate of serial execution of the system. Although it is very similar to the concept of
idempotency in mathematics. But according to Theorem 1, the distributed system is a dynamic system, so the two requests are
independent from the result or the process.
When a system is a complex system consisting of multiple consecutive requests and concurrent requests. For example, a system
consisting of three s1, s2 and s3 servers. The probability of successful requests is 90%, 95% and 99% respectively. Running software
systems with three r1, r2, r3 requests. R1 calls server s1, s3 is r1 (s1, s3). The other two requests are r2 (s1, s2, s3) and r3 (s2, s3). Then
we can get the success probability of r1 (0.9*0.99) = 0.89, r2 (0.9*0.95*0.99) = 0.85, r3 (0.95*0.99) = 0.94 respectively, and the system
operation probability is 1 - ((1-0.89)* (1-0.85)* (1-0.94) = 99.9%. That is to say, the probability that all services of the system will stop
responding is 0.1%. This is shown in Figure 4.
Figure 4.
Characteristics of Memory Database and Hard Disk Database

Although continuous requests increase the probability of system response, the overall stability of the system does not depend on
the probability of successful response. Because if the database cannot be used, it will cause data loss. Loss of data can make the
system completely lose its service capability. Suppose that there is only a memory database in a system and no hard disk database.
The probability that the memory database will run normally is 96%. Once the memory database fails to run and data is lost, the
system loses its ability to run completely. So the upper limit of the overall stability of the system is 96%. Memory databases are like
the shortest board in a barrel. It determines the low point of the overall stability of the system.
Because the hard disk database can ensure that data is not lost for a certain period of time after the service is restarted. So hard disk
database becomes the upper limit of system stability. As long as the hard disk is not damaged enough to be unusable and data is
lost, even if the system is restarted, it will not cause the system to be unusable. It is pointed out in a hard disk failure rate report [1]
provided by backblaze, a data backup manufacturer. The most frequently used hard disk failure rate is 2.22%. By using RAID0, the
data loss rate in dual hard disk backup is 0.05%. And it can further reduce the probability of data loss through more backups. Then
the stability of the system will be greatly improved after using the hard disk database.
Although the memory database is far less stable than the hard disk database. But the memory database runs much faster than the
hard disk database. Random reading can reach a thousand times of the gap, while sequential reading and writing also have a 10
times gap. The order of stability from the highest to the lowest is hard disk database, memory database and service container. From
the point of view of response speed, the fastest to the slowest are service container, memory database and hard disk database.
Response speed refers to the response to a user's request. If the service container does not use the memory database and the hard
disk database, the processing speed is the fastest. Because the response to processing requests is faster, the amount of concurrency
supported is the highest. Of course, the stability here is relative, even if it is the worst server container. The service containers under
development are frequently updated artificially. Any update behavior can lead to potential downtime. Properly managed service
containers can ensure more than 95% smooth operation. Consider the possibility of 1% to 5% damage to any hardware. The
software stability rate is very close to the hardware stability rate. So the main memory database serves as a temporary data backup
function. Save data usage during service container replacement.
Ambiguity
When there are multiple data in a distributed system, such as memory database and hard disk database, the user account
information is stored simultaneously, which is called data ambiguity. From Inference 1, we know that there is consistency of delay
time t in this ambiguity. Ambiguity in distributed systems can lead to irreconcilable data errors. Suppose there is a system that
accepts user data in the server container and returns it, and then propagates the data to the memory database and the hard disk
database in turn. This is shown in Figure 5.
Figure 5.
Assume that writing to the hard disk database fails during propagation, and that the operation has been returned to the user
successfully. When the user exits offline, the memory database is deleted. Users will find data lost when they go online again. This is
the ambiguity between user return and actual storage.
Another possibility is that the hard disk database was successfully written and the memory database failed. It is always wrong for
users to query data, which leads to repeated operation of users and further enlarges the impact of errors.
In the process of transmission from service container to memory database and hard disk database. Because of the existence of delay
time t, errors occur during this period, which leads to the success of some devices writing, and ambiguity occurs when some devices
fail to write. Because the asynchronous system may be accompanied by other asynchronous operations in the process of writing.
This makes deleting erroneous operations extremely difficult. This ambiguity error in distributed systems is caused by our attempt
to change the propagation direction of the consistency of delay time t. The original direction of transmission is from service
container to memory database to hard disk database. We tried to recover server container errors using data from memory or hard
disk databases. The delay time t and the probability of availability of the distribution will inevitably lead to the risk of data errors.
High performance is accompanied by higher risks, while high stability is accompanied by lower risks. Only risk cannot be eliminated
completely.
By distributing the system, the risk of the system can be further reduced. For example, multiple systems are split and different
requests are processed separately.
Figure 6.
At this point, we have a new way to deal with the risks of the system. Formula 2 shows that distributed system can greatly reduce
the probability that the whole system can not be used. Then when we need higher performance when we have higher system risk. If
it is distributed, the risk of the whole system will be greatly reduced. That is to say, the number of users affected by the system is
reduced to 1/n, where n is the number of distributed users. For example, a website is designed as a service container, a memory
database and a hard disk database structure. One day, a large number of unexpected users flooded into the promotional activities,
resulting in blockage of hard disk IO writing during the peak half hour. The website adopts partial performance design, ignoring the
failure of hard disk database writing. Within half an hour, all operations of users who have logged in have not been affected except
that the users cannot log in. If the congestion time exceeds 1 hour, the memory database data will be cleared, which will result in
the loss of user's operation during this period. The part affected by congestion is two parts of a 10-part distributed system. So two-
tenths of all users are potentially missing.
In computer systems, there is also a process of propagation based on the consistency of delay time t. Caching in CPU to memory
and then to hard disk is a propagation process. There is also a probability of data inconsistency in this process, which is only very
low due to very short time.
Figure 7.
Another common case that leads to ambiguity is the backup server created for the primary server. When the data of the primary
server is not updated to the backup server in time, the downtime of the primary server results in the loss of data. At this point, the
start of the backup server will cause the old data to overwrite the new data. The occurrence of this situation is related to two factors.
The number of synchronizations initiated and the time of synchronization. It is assumed that server downtime will inevitably occur in
the process of synchronization. The more synchronization, the longer synchronization time, the greater the impact. The number of
synchronizations on the Internet can be approximately replaced by the number of visitors per unit time. The effect of
synchronization interruption can be expressed in Formula 3.
Number of People Per Unit Time Online / (Synchronization Time/unit time)

For example, the current online population is 3000 people, synchronization time is 1 millisecond. The number of people affected by
interruption is 3000 / (1/1000) = 30. When the outage occurred, the number of people affected was 30. Formula 3 shows that the
fewer people are online and the shorter the synchronization time is, the fewer people are affected by the wrong data.
When the server is prone to downtime, we will try to use multiple backups to eliminate the impact of downtime. Common sense is
violated because backup time between servers is much longer than disk backup time. Because it needs to be realized through the
network and various software. This leads to the more backups we have, the more time it takes. Formula 3 shows that the more time
consumed, the more people affected by errors. This, on the contrary, leads to a large increase in ambiguity errors due to the
increase in the number of backups. So the suggestion of backup using hardware solution as far as possible can greatly reduce the
damage caused by synchronization time.
Propagation of AP-Tree Structure

The process of transmitting data from the place where it was changed to other hardware is called AP method or AP relationship.
Because putting data in only one hardware can cause other hardware to be blocked in the process of using this hardware. This
propagation results in reading data not being up-to-date or absolutely consistent, but based on the consistency of delay time t. So
the data propagated under AP relation cannot be used for operations with atomic relation. That is to say, data in AP network cannot
be written back into AP network. Because the existence of delay time t will cover the newer data with older data. This is shown in
Figure 8.
Figure 8.
The backup of servers may lead to the situation that old data covers new data. When the primary server is offline and the standby
server is online, the old data in the standby server is superimposed with new requests, which can lead to data confusion. For
example, a merchandise main server records 0 items in stock, while a backup server records 1. When the backup server is online, it
will cause users to buy non-existent items.
The AP relationship gives us access that will not be blocked under almost any circumstances. And it can be extended by tree
propagation. This extension brings almost unlimited traffic and hardware support. This is a very important design method in the
distributed domain. It is used in content distribution, DNS, ZooKeeper and other important distributed tools. Its distribution forms
are also varied, such as gossip, routing protocols, etc. We can also use the blocking method to ensure that the data is not affected
by the delay time t. Or a competing algorithm such as Bitcoin is used to eliminate the effect of delay time t.
In ZooKeeper, a tree-divergent network, there are two different operations: read and write. Formula 2 can be used directly to
calculate the probability of success of reading. The probability of writing is divided into different stages. The probability of writing to
the original origin is the probability that the origin is available. When the origin diffuses to the next layer, the probability of
successful re-diffusion is obtained by using formula 2 for all nodes in the current layer. That is to say, if the diffusion of any node
succeeds, the diffusion behavior will continue. It can be seen that once the write behavior begins to spread in AP relations, the more
nodes that have been successfully diffused, the higher the probability of completing all the diffusion successfully. And the success
rate of reading is surprisingly high. This is shown in Figure 9.
Figure 9.
Protocol and Data

Theorem 1 shows that distributed networks are built around shared data in asynchronous networks. When an external request
arrives at the network, it finds the corresponding data set and triggers the task operation data set. When a data set has multiple
backups in a distributed network, a data set needs to be selected for writing and then diffused to other data backups. The data set
that this choice is written to must be unique in a distributed network. Otherwise, data ambiguity will arise. Once the data set is
selected, multiple operations on the data set are fixed. This selected point becomes the hardware and software intersection point
for processing the specified data set. This point becomes the protocol for specifying data set operations. Note that this is not
limited to data. This selected protocol must be unique when there are multiple data backups in a distributed system. That is to say,
distributed system is built around data protocol. The entity of this protocol is the hardware that stores the data. Data that does not
become a protocol is referred to as backup data or read-only data. Writing on backup or read-only data is invalid.
This data already exists when a computer in a distributed cluster generates a result data set based on computation. We may not
return to the client to report successful execution until we have stored it in an in-memory database or a hard disk database. But
whether the report is returned or not does not affect the stability of the data. Because the hard disk also has the situation of losing
data, but this possibility is relatively low. Data is generated from the beginning and then diffused to other hardware. The source is
the point at which tasks and data are linked. This protocol is not only the protocol between tasks, but also the protocol between
tasks and data. For example, it may be named "user information" for data and registration tasks.
Since it's called a protocol, it's directional. That is, the task is processed until it is published to write data. Written data is then
diffused through AP relationships in distributed systems.
Figure 10.
Reading and writing to data are two operations, but for tasks, they are the input of execution and the output of execution. So
writing data is an important basis for judging the atomic relationship between tasks. In an asynchronous network, any task will be
executed at any time, so the data written jointly between any two tasks constitutes a write conflict. That is, the two tasks constitute
the atomic relationship.
Atomic relations in asynchronous networks are an interesting concept. For example, my wife and I like to eat steak, so we can both
make steak at any time. So the use of frying pans should be queued up. Otherwise, it may happen that I have a frying pan without beef,
and my wife has beef without a frying pan, resulting in the embarrassment that nobody can make steak. So every time you want to
make steak, check your calendar to see if the frying pan is free. The fact is that I haven't cooked steak in a year, and only my wife is
using the frying pan. This example reveals that potential conflicts are related to the frequency of occurrences. When the frequency is
low, the probability of conflict is low. Suppose this frying pan is the only frying pan in our community. Residents of any community can
only use this frying pan to make steak. Then I think it is more in line with the definition of atomic relations in asynchronous networks. I
have an atomic relationship with all my neighbors about the frying pan.
Performance Analysis in Distributed System Design

In a small system with only one server container and one hard disk database, the completion of a request, from the service container
to the database, must go through (a) service software, (b) server container, (c) network links, (d) database software, (e) hard disk. The
probability of successful request is (a)* (b)* (c)* (d)* (e)= F. If the data is stored directly on the hard disk, the request will go through
(a) service software, (b) server container, (e) hard disk, and the probability of successful request is (a)* (b)* (e)= G. Because 0 <(c)*(d)
< 1, f < g, the success rate of a small system request for a service container and a hard disk database is less than that for a single
service container. Because the system can retry indefinitely to improve the success rate of requests. So the overall stability of the
system depends on the stability of the hard disk. Both F and G use hard disks, so the overall stability is the same. When multiple
service containers use the central database together. This is shown in Figure 11.
Figure 11.
I used a community-shared frying pan as an example. Assuming that one of the service containers is using the frying pan, the other
service container requesting the frying pan will wait. Then the parallel system degenerates into a serial system. The system will obey
Amdahl's law. That is to say, the partial performance of system parallelization has been improved, while the partial blocking of
serialization is in the database. If serialization results in the performance degradation of database processing, which affects the data
distribution function and the parallel processing ability, then it will lead to the decline of the overall processing ability of the system.
Assuming that each service container is affected by the same X because of serialization, the time taken by N links is x*x*x... x ^ n,
that is, serialization of databases can lead to an exponential decline in overall processing power. The more problems that arise, the
greater the performance requirement will be and the more serious the problem will be. The serialization part has not been well
solved. For example, locking Table 1 requests Table 2 and locking Table 2 requests Table 1 two tasks can form deadlocks. This is
shown in Figure 12.
Figure 12.
In Internet software systems, 80% of the data is inactive. These data are stored on the hard disk and need not be computed. And
20% of the data is the data that needs to be calculated at present. Hard disk database location is to manage 80% of the data and
provide multi-dimensional large data analysis function. When the data is activated because the user uses it, it needs to be
transferred to the service container for processing. Assuming that the hard disk array has 288T capacity, the system needs 56T of
memory to run. A single server can't provide so much running memory, so the system needs a distributed server.
The way data sets are centralized and stripped helps to distribute parallel parts. It helps to optimize the performance of distributed
systems, such as reducing the computing power of hard disk databases. But it doesn't help the serialization part. Its essence is the
AP of software, that is, the availability and protocol of data. As shown in Figure 11, the improvement of data availability is limited to
the improvement of distributed performance. Upgrading to a certain level will be limited to the serialization part. Because the serial
part of the system is highly concentrated in the database. Even two non-interacting serializations must be done in the database.
Because databases cannot know how software systems use data and establish atomic relationships. Serialization is possible between
any tables for a database. Because of the preparation for infinite possibilities, that is, any table can be serialized. The database can
only put all tables together. This centralization results in the failure to decompose the functions undertaken by the database.
Because the serialization part must be placed in a single thread to ensure the atomicity of the processing. This is confirmed by
Amdahl's law. Parallel parts of software can use as much computing power as possible to support access pressure. That is, the
performance of the parallel part is linearly related to the number of hardware provided. The serial part can only run in a separate
service container. So the serial part cannot improve the performance because of the increase of hardware. So the performance of
the serial part will be a fixed value in the system. For example, the central database in the previous example, because the serial part
of the whole system is executed in the database. Then the ratio of the computing requirement of the serial part to the computing
capability of its service container is the performance ratio of the whole distributed system. For example, the current CPU utilization
rate of the central database is 50%, because all serial operations are in the central database, so the computing power of the whole
distributed system has been used 50%.
Suppose we have two central databases to handle the serialization part. One of them has 100% CPU usage and the other 50% CPU
usage. Is our overall usage 75%? Obviously not, because 100% of servers have a large number of potential computing requirements
that cannot handle the near collapse. If less than 100% can be approximated as the requirements are met, if any server has been
greater than 100%, it is beyond the normal range of use.
It can be seen that in the distributed system, the performance of the system is determined by the serial part, that is, the part with
atomic relationship. That is to say, there are tasks of reading and writing data or writing data only in RP method. The serial part is
not complete and inseparable. It can be divided according to the different data written. Consider two Amdahl systems running in
the same set of hardware.
If the two serial parts are placed in the same hardware, the overall performance will be low, and if they are opened in two different
hardware, the overall performance will be improved. A complex software system may have multiple Amdahl systems. Finding these
Amdahl systems and putting them into separate hardware can greatly increase the processing cap of distributed systems. The RP
method is to find these hidden Amdahl systems. Fundamentally improve the upper limit of the processing capacity of distributed
systems.
When we find the hidden Amdahl system in all software systems, we can get the upper limit of the carrying capacity of the
distributed system by multiplying the number of Amdahl systems by the hardware used. For example, we have two Amdahl systems
for payment and purchase, each with a single user request that takes about 0.001 seconds to process. Then put the two Amdahl
systems into two servers. The upper limit for concurrent payments and purchases per second is 2,000.
Conclusion
The software system is composed of parallel part and serial part. The serial part is composed of several serial parts. According to
Amdahl's theorem, the performance of the parallel part is proportional to the number of hardware. The serial part is proportional to
the hardware performance. Because the serial part usually accounts for only 20% of the software system, the distributed system
usually only solves 80% of the parallel part. Adding up to 80% of the software systems will not result in high-intensity concurrent
access. So only about 4% of distributed systems involve high-intensity access. And only a very small number of top companies pay
attention to these 4% serial issues. So all kinds of problems that serial problems may bring are often overlooked. Even these serial
problems are simply regarded as system bugs. It's not that companies and systems in general don't need distributed systems.
Because problem is only a lower probability. Once problems arise, it will be a fatal blow to the company's technical team.
Technicians should be concerned about any possible system defects.
In the next chapter, I will use the website of online shopping mall to illustrate how to create a distributed system. How to balance
slicing, caching and hard disk?
Reference
1. https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2019/
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
About the Author

sun.shuo@aliyun.com
Distributed Technology Professional
China
Comments and Discussions

0 messages have been posted for this article Visit https://www.codeproject.com/Articles/5114968/ITDSD-4-Quantitative-
Analysis-of-Distributed-Softw to post and view comments on this article, or click here to get a print view with messages.
Permalink Article Copyright 2019 by sun.shuo@aliyun.com

Advertise Everything else Copyright © CodeProject, 1999-2019
Privacy
Cookies Web02 2.8.190718.1
Terms of Use

ITDSD - 4. Quantitative Analysis of Distributed Software - CodeProject

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ITDSD - 4. Quantitative Analysis of Distributed Software - CodeProject

Transféré par

Droits d'auteur :

Formats disponibles

22/07/2019 ITDSD - 4.

Quantitative Analysis of Distributed Software - CodeProject

ITDSD - 4. Quantitative Analysis of Distributed Software

15 Jun 2019 CPOL

Introduction to Distributed System Design - 4. Quantitative Analysis of Distributed Software

Distributed Method of Web Service

A Network With Shared Data Must Be an Asynchronous

Theorem 1. A Network Composed of Single-Threaded Servers Sharing Data Must

Inference 1, Information Synchronization Between Any Two Points Will Have a

Probability of Serialization and Concurrency

Characteristics of Memory Database and Hard Disk Database

Number of People Per Unit Time Online / (Synchronization Time/unit time)

Propagation of AP-Tree Structure

Protocol and Data

Performance Analysis in Distributed System Design

About the Author

Comments and Discussions

Permalink Article Copyright 2019 by sun.shuo@aliyun.com

Vous aimerez peut-être aussi