learningFromDocs PDF

Kira Systems
263 Adelaide St. W - 350

Toronto, ON
M5H 1Y2
April 1, 2015
Uday Bagampalli
Deloitte
Toronto, ON
M5J 2T3
Dear Uday,
As a follow up to our phone conversation, I wanted to expand on the changes that will be made to
the machine learning aspects of Kira Contract Analysis. These changes concern themselves with two
objectives: additive learning and learning from deleted documents.
At the heart of Kira Contract Analysis is a machine learning algorithm that locates relevant information
within documents based on a series of user examples. The search/classification component functions
based on a decision model that is created offline, that is the creation of the model from a series of
examples that is done once and the model re-used over and over again.
Current
Model
Individual
Document
Classify
Algorithm
Classified
Document
Figure 1: The primary mechanism of classifying documents.

Additive learning is a process by which the machine learning aspects of Kira Contract Analysis will be
optimized to learn incrementally from newly annotated / classified documents instead of reprocessing
all documents each time.
Learning from deleted documents is an addition that will permit the machine learning algorithm to
remember the annotations/classifications made by the users without requiring the documents to be
available within Kira Contract Analysis.
In the following two sections, I will explain how the system works, what is being modified within the
system, the benefits and potential issues associated with these.
Additive Learning
Figure 2 represents the current process with which Kira Contract Analysis decision models are created:
all documents, both previously learned from and newly annotated, are scanned for user examples and
a machine learning algorithm then creates a decision model that replaces completely the old decision
model.
Old Model
Learning
Algorithm
Previously
Classified Documents
New Model
New
Figure 2: The current process of creating a completely new model each time it is run.
We wish to optimize this process by re-using whatever information is already present within the old
model by simply adding the information contained within new documents. The process is outlined in
Figure 3 where the old model is used instead of the documents previously processed.
Old Model
New
Learning
Algorithm
New Model
Figure 3: The old model is reused and added to using newly added documents without referencing the
raw old documents themselves.
This involves the modification of Kira Contract Analysis in two ways: one, the parsing of old documents
is no longer done and secondly, the mathematical process within the algorithm is modified so that any
previously useful computation is taken into account when creating an updated model.
Because the cost of disk/memory operations are disproportionately expensive when compared to mathematical operations, avoiding the processing of old documents will show a dramatic improvement in
the learning speed of the system. Modifying the mathematical process of the learning algorithm will
require some thought and rigorous testing to ensure that no inadvertent error has been inserted into the
2
system. We will make use of a commonly accepted method to prevent this by comparing the output of
the new engine to the old one in a battery of tests to ensure that the results are the same under the
same conditions.
Learning from deleted documents

A feature that is of interest is the ability to remove documents from Kira Contract Analysis while
retaining the examples used to train the machine learning algorithm.
With the modifications made to the engine to support additive learning it is possible to remove a
document while keeping the users training example into the model. However, while removing the file
from the system means that the file is no longer there, the information contained within the training
example can be recovered from the model. This is a potential problem in that in certain cases it may
be possible to recover part of the documents while the documents themselves have been deleted.
To this end, a modification can be added to both the learning/classification algorithms where the
individual training examples are converted to long numbers using what is called a hashing algorithm.
This algorithm functions only in one direction so that while the string Ontario in Figure 4 will give
the number 095463, it is not possible to reverse the process to get the string Ontario from number
095463.
The machine learning/classification algorithm itself does not have a semantic understanding of what a
feature such as 095463 or Ontario actually means. Since the substitution of the features for numbers
is consistent across each document each time the algorithm is run, the learning/classification can occur
without actually storing the human readable fragment from the original text. Thus, we can make the
claim that not only has the original document been deleted, but that the fragments within the model
itself have no meaning to a human being either.
"857324"
"432789"
"873465"
"120943"
...
"095463"
Current Model
"Ontario"
"095463"
Hashing
Feature
Generator
"Date of"
...
Individual
Document
"573245"
Hashing
...
"Lease"
...
"212357"
Classify
Algorithm
Classified
Document
Hashing
Figure 4: Large numbers replace human-readable user examples making the recovery of the information
nearly impossible.
A sophisticated adversary with resources could attempt to recover the contents of the training examples
through the use of a dictionary attack. Should the adversary acquire both the model and a knowledge
of the hashing algorithm used, he could attempt to hash every word within an English dictionary in
order to identify limited parts of the deleted documents. To mitigate this, we add a salt to the hashing
algorithm that is an arbitrary string (we use Deloitte here) to perturb the hashing algorithm in a
predictable way.
3
"Ontario"
"095463"
Hashing
"Ontario"
Salt
"976543"
"OntarioDeloitte"
Hashing
"Deloitte"
Figure 5: Adding a re-occurring string to the hashing function, a salt, increases the difficulty to recover
the contents of the model.
Figure 5 represents the process where the Deloitte string is being concatenated to every example
before it is passed on to the hashing algorithm. This essentially changes the input to the algorithm in a
predictable fashion known only to the computer code, making its results inconsistent with the hashing
results without the salt.
Hence, an adversary with a copy of the model and knowledge of the hashing algorithm would be unable
to perform a dictionary attack since they would be unaware of the salt value in use. Given sufficient
resources, a dedicated adversary could eventually determine the salt, however we believe this to be
beyond the capacity of most large corporations.
In closing, the use of additive learning within Kira Contract Analysis will lead to increased performance
from the point of view of the end user while reducing the storage and computational load of Kira Contract
Analysis. Furthermore, it will enable the deletion of documents from the document repository while
retaining the documents value in training the system. Using sophisticated hash-generation techniques,
we believe that this is done in a way that prevents the contents of the documents from being recovered.
Yours truly
Robert Warren

learningFromDocs PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

learningFromDocs PDF

Transféré par

Droits d'auteur :

Formats disponibles

Kira Systems

263 Adelaide St. W - 350

Figure 1: The primary mechanism of classifying documents.

Learning from deleted documents

Vous aimerez peut-être aussi