Académique Documents
Professionnel Documents
Culture Documents
Current
Model
Individual
Document
Classify
Algorithm
Classified
Document
Additive Learning
Figure 2 represents the current process with which Kira Contract Analysis decision models are created:
all documents, both previously learned from and newly annotated, are scanned for user examples and
a machine learning algorithm then creates a decision model that replaces completely the old decision
model.
Old Model
Learning
Algorithm
Previously
Classified Documents
New Model
New
Classified Documents
Figure 2: The current process of creating a completely new model each time it is run.
We wish to optimize this process by re-using whatever information is already present within the old
model by simply adding the information contained within new documents. The process is outlined in
Figure 3 where the old model is used instead of the documents previously processed.
Old Model
New
Classified Documents
Learning
Algorithm
New Model
Figure 3: The old model is reused and added to using newly added documents without referencing the
raw old documents themselves.
This involves the modification of Kira Contract Analysis in two ways: one, the parsing of old documents
is no longer done and secondly, the mathematical process within the algorithm is modified so that any
previously useful computation is taken into account when creating an updated model.
Because the cost of disk/memory operations are disproportionately expensive when compared to mathematical operations, avoiding the processing of old documents will show a dramatic improvement in
the learning speed of the system. Modifying the mathematical process of the learning algorithm will
require some thought and rigorous testing to ensure that no inadvertent error has been inserted into the
2
system. We will make use of a commonly accepted method to prevent this by comparing the output of
the new engine to the old one in a battery of tests to ensure that the results are the same under the
same conditions.
"857324"
"432789"
"873465"
"120943"
...
"095463"
Current Model
"Ontario"
"095463"
Hashing
Feature
Generator
"Date of"
...
Individual
Document
"573245"
Hashing
...
"Lease"
...
"212357"
Classify
Algorithm
Classified
Document
Hashing
Figure 4: Large numbers replace human-readable user examples making the recovery of the information
nearly impossible.
A sophisticated adversary with resources could attempt to recover the contents of the training examples
through the use of a dictionary attack. Should the adversary acquire both the model and a knowledge
of the hashing algorithm used, he could attempt to hash every word within an English dictionary in
order to identify limited parts of the deleted documents. To mitigate this, we add a salt to the hashing
algorithm that is an arbitrary string (we use Deloitte here) to perturb the hashing algorithm in a
predictable way.
3
"Ontario"
"095463"
Hashing
"Ontario"
Salt
"976543"
"OntarioDeloitte"
Hashing
"Deloitte"
Figure 5: Adding a re-occurring string to the hashing function, a salt, increases the difficulty to recover
the contents of the model.
Figure 5 represents the process where the Deloitte string is being concatenated to every example
before it is passed on to the hashing algorithm. This essentially changes the input to the algorithm in a
predictable fashion known only to the computer code, making its results inconsistent with the hashing
results without the salt.
Hence, an adversary with a copy of the model and knowledge of the hashing algorithm would be unable
to perform a dictionary attack since they would be unaware of the salt value in use. Given sufficient
resources, a dedicated adversary could eventually determine the salt, however we believe this to be
beyond the capacity of most large corporations.
In closing, the use of additive learning within Kira Contract Analysis will lead to increased performance
from the point of view of the end user while reducing the storage and computational load of Kira Contract
Analysis. Furthermore, it will enable the deletion of documents from the document repository while
retaining the documents value in training the system. Using sophisticated hash-generation techniques,
we believe that this is done in a way that prevents the contents of the documents from being recovered.
Yours truly
Robert Warren