Capstone Proposal Anderson

Anonymizing Text
Steven Anderson
Directed Group Capstone CST499-30
April 22nd 2018

1
Table of Contents
Table of Contents 1
Executive Summary 2
Introduction 3
Project Objectives 4
Future Project Improvements 4
Contributor Objectives 5
Literature Review 5
Stakeholders and Community 6
Approach 7
Ethical Considerations 8
Legal Considerations 9
Project Scope 10
Timeline 10
Resources 11
Milestones 11
Risks and Dependencies 12
Final Deliverables 12
Team Members 13
Appendix 14
Usability Testing 14
References 17
2
Executive Summary
While online it is possible to identify a person by the way that they type. This capstone
project is to try and help users make their text non-identifiable online. The project will try to give
feedback on how well the browser plugin that is to be created as a final product works. By
building a browser plugin it should hopefully make the barrier to entry for such a product lower
than current solutions which are tools that are desktop applications and similar applications. By
providing an easier solution hopefully more people will be able to adopt the tool and will be able
to get a number of developers working on the project as current solutions seem to not be actively
in development. The tool will likely be used with other tools that people use who are concerned
about privacy and will become an additional tool to use. The browser plugin should make the
process completely automated for the user. Automated and other test results will be shown to
users to make users aware of well well the browser plugin works. Transparency will be important
for users and developers alike.
To try and make it so a user’s text cannot be identified a browser plugin will be built to
try and change the way that the user’s text after they type something. To try and make a user’s
text non-identifiable there will be many steps including: fixing spelling errors and replacing
words with other similar words. By making text non-identifiable the project should help people
maintain their privacy when talking in many different online forums. The goal of the project to
encourage adoption of the browser plugin is to be as transparent as possible due to the project
being open source and giving results of the outcome of research. To try to further transparency
all methods of how text transformation will be documented in plain English, explaining the
whole process of what goes into trying to anonymize text.
There should not a population that is negatively affected by the outcome of the capstone.
While the capstone may be used for negative purposes, this is not the intention of the capstone
project. The capstone should hopefully help people who are concerned about their privacy for a
number of different reasons. This tool should hopefully help those who may be negatively
affected if they are just trying to speak their mind and give them an easier way to do so without
as much fear of retaliation. The project should hopefully not negatively affect people and ways
of combating misuse will be taken into account throughout the lifecycle of the project and after
the release of the capstone. Some possible solutions to try to combat misuse is to allow platforms
to identify if someone is using the tool. It may be a problem if platforms do no not adopt
identification of the tool and I could see many false positives with trying to identify the browser
plugin. If the tool becomes popular enough platforms that are negatively affected will be
encouraged to try to mitigate people from misusing using this tool. While there will be misuse
there will be a high priority on combating misuse.
The end goal of the project is to allow a user to easily anonymize their text. Hopefully the
way that the project is setup will allow developers to contribute in the future. The initial release
of the plugin will get the job done but not very well. Improvements could be made in the future
to the plugin to allow for more coherent replacements of text and possibly learning how the user
types to better understand what to replace. Though current solutions allow users to anonymize
their text it seems that there is a demand for an easier solution. Hopefully by presenting users an
easier solution the project will likely allow a great number of users to adopt the technology due
3
to how low the barrier is compared to current solutions. The end result should be a working
product and a framework for developers to work with in the future and be able to build off of in
the future.
Introduction
While using the internet many try to protect their privacy. Privacy is something that a
number of people are concerned about but there are many aspect that one should be concerned
about. The way that a person types on a computer can be identified through a practice called
stylometry. Stylometry is looking at patterns in a person’s speech and recognizing them. The
practice of stylometry can likely be used to identify people across multiple platforms with a
reasonable degree of accuracy without any other clues about their identity.
While stylometry can be used to find out the author of a piece of text there are ways to
hide who wrote something. There are currently ways to anonymize text but current solutions are
not very user friendly. The product that I will be making will be a browser plugin that will allow
a user to easily anonymize their text in the browser. The browser plugin will take various steps to
try to anonymize the user’s text. The first step is by correcting any spelling errors in the user’s
text. By correcting spelling errors we are removing one way of identifying a person. The next
step is to break down the sentence structure of the user’s text. Once the text is broken down the
plugin will then start replacing some of the words in the user’s sentence with other words from a
thesaurus that has been built. After these steps have been completed the text should be
anonymized.
4
Anonymizing text is just one step in trying to protect one’s privacy online. The end result
will hopefully help people who are concerned with privacy. Upon researching the topic there
were no solutions that made anonymizing text as simple as using browser plugin. The browser
plugin should help user’s keep different parts of their lives separate as the internet evolves.
Project Objectives
● To remove identifying characteristics of a user’s text.
● Build a thesaurus of the most common words for reference
● Correct spelling mistakes in text
● Swap words in text with words from the thesaurus
● Automate process for user
● Show how effective browser plugin is by doing tests
Future Project Improvements
● More in-depth into stylometry
● Using machine learning algorithms for testing
● Fully automating testing including test samples
● See how length of text effects anonymization
● Support other browser platforms, the test case will only support firefox
● Look into more premium options for testing, i.e. using platforms created by companies
such as IBM for looking into effectiveness of tool

5
● Create sentence structure that is emotionally equivalent
● Create sentence structure that is grammatically equivalent
● Create sentence structure that is contextually equivalent
Contributor Objectives
● Contributors will learn how to use many tools to test output
● Contributors will get exposure to different programming languages and styles
● Contributors will learn to simplify abstract problems
Literature Review
The process of anonymizing text can take a variation of different approaches. Upon
researching the topic there seems to be a number of ways for people to be identified through
stylometry. One way that PBS was able to identify an author was based off of how often a word
was repeated (PBS). While the PBS article is useful it does not exactly line up with what I am
trying to accomplish.
One tool that will be useful during researching the topic and has given insight into the
topic is jstylo. (jstylo). Jstylo was developed at the privacy, security and automation lab at
Drexel University in Philadelphia. Jstylo can be used to attribute works to an author. The other
part of the anonymization process that has been done is in the program anonymouth
(anonymouth). Anonymouth is the implementation of jstylo that does the anonymization. These
6
tools were both written to try and combat stylometry, the research states they are focusing on
sentence length, word choices syntactic structure (Brennan).
Jstylo and anonymouth seem to be the only some of the only tools available to anonymize
text. There has been other research into stylometry and implementing programs to try and
combat it but the research seems few and far between. While not specifically under stylometry it
seems there are machine learning algorithms out there that may help to identify text, this was
briefly talked about in the research on jstylo and anonymouth (Brennan). At this time there is not
much research on the topic specifically but there is a large amount of research about the topic of
stylometry. The research from jstylo will be invaluable when I start trying to build the browser
tool.
Stakeholders and Community
The browser plugin that is to be built will be built with the privacy of the user in mind.
The project will be completely open source, this includes any research that is done on the topic.
Transparency will be the goal including how effective the tool is when ran through different
forms of identification. The likely users of this will be people who are already concerned about
their privacy and use tools such as virtual private networks and browser plugins such as https
everywhere and privacy badger.
The project should hopefully help those who are concerned about their privacy. As the
internet grows so does my concern for privacy. I personally believe in building this tool to help
others who like me are concerned about their privacy. While there are no formal outside
7
stakeholders there is a probable gain of privacy if the tool is created. Nothing will really be lost if
the plugin is not created.
The browser plugin will likely be maintained by myself and other people in the open
source community who are interested in the project. Currently I am the only stakeholder in the
project with potential stakeholders in the future and a community that may be interested. From
my time of looking people have seem to have been interested in this idea of a project. Like stated
previously there are jstylo and anonymouth that accomplish similar goals, but do not have an
active community around them.
The end goal is to make people interested in the project after it is released. Hopefully a
community will be built from the plugin being released. With more people working on the
project I hope that the project will be able to go further then when I originally work on it and
release it.
Approach
By the process of anonymizing text it is also important to try and deanonymize it. Before
I can begin working on the project I will have to lay groundwork for the project. My first goal is
to build the browser plugin. Once the browser plugin is built and is functional at a basic level
then I will move onto collecting data. I will be using data collected from various sources online
and running them through the browser plugin and then testing the output against tools such as
jstylo (jstylo). With jstylo I will try and identify the author and see how effective it it. To get the
data that is collected I will be using a variety of python scripts, manually scraping data and
downloadable files to try and get a wide variety of texts to test on.
8
When breaking down sentences I will be looking at word choice and spelling of words.
By switching different words in a sentence and removing spelling mistakes I should get rid of
most of the identifying data from a user’s sentence based off of the research from jstylo and
anonymouth (Brennan). I know that I may have to look into other considerations of how a
sentence is structured but that can always be a later addition.
After my initial tests of running outputs through jstylo I am interested in looking into
other ways of identifying text. The research from anonymouth briefly mentions machine learning
as an avenue for identifying text and I am interested in exploring it if time permits (Brennan).
Once the plugin is created and research is done I will be releasing any tools that I have created to
facilitate the research, the research itself, how effective the outcome was and the browser plugin
itself. Documentation will be included with the research on how everything works including:
running tools, methods for scraping data and how the project works as a whole.
Ethical Considerations
During the making of the capstone I cannot foresee any major ethical issues that could
occur. As long as I cite all of my information and properly use all licenses I do not see any
ethical issues during the building of the capstone. After the capstone is built I could see it being
used for plagiarizing material but I do not believe it will be effective in aiding people in
plagiarizing because of how inconsistent the language would be. After the project is completed
and if time permits research will be done to see how it affects plagiarization.
9
There is not likely any underprivileged groups that would be directly negatively impacted
by the capstone. The capstone could potentially be used to negatively impact underprivileged
groups by way of hate speech but that is not the intention of the project. There is likely no way to
prevent the misuse of the tool that is going to be made for the capstone but sample inputs and
outputs could be given to try and curb people misusing the tool. By providing inputs and outputs
platforms could potentially try to do identification on the tool, they would not be able to find out
the user most likely but may be able to stop the usage.
Due to the tool purpose of being used by different users it is difficult to gauge all of the
ethical concerns with the project. Throughout the development of the project there will be more
investigation into stopping users from misusing the tools made. A potential solution besides
giving examples is running identification on outputs from the tool itself to see if users can be
identified in a similar manner to before their speech is anonymized. Ultimately it will have to be
up to different platforms to try and stop users from misusing tools to anonymize their speech.
Legal Considerations
I am going to have to be careful of legal issues during development of the capstone.
There will need to be care taken when picking libraries to help in the development of the
capstone to ensure that licenses of the libraries and the final product do not conflict. In general
open source software is going to be used and as a result the terms of the licenses will need to be
met as many different licenses have different clauses that must be fulfilled to use the software.
10
Another possible concern is the thesaurus that needs to be built. Thesaurus companies
may have to be contacted so a thesaurus can be built for the capstone. Licensing will need to be
dealt with when building the thesaurus unless an open source or public domain thesaurus can be
used. Besides the previously stated issues there should be no problems legally with the project.
As long as licenses are abided by and licenses are used that absolve creators of any liability there
should be no issue with building tools to anonymize text and releasing them for users to use.
Project Scope
Timeline
Complete by To complete
Week 1 Thesaurus source
Week 1 Gather open source components
Week 1 Have open source components working

individually in the browser
Week 2 Have a prototype of the browser plugin

complete
Week 2 Initial testing of the browser plugin
Week 3 Improving the browser plugin with data from

testing of the browser plugin
Week 3 Doing additional testing of the browser plugin
Week 4 Look into additional methods for testing

output
Week 4 Additional testing
Week 5 Wrap up testing

11
Week 5 Start getting the browser plugin ready for

release
Week 5 Gather data collected to prepare for release
Week 6 Release browser plugin
Week 6 Release data gathered
Resources
● 2 computers, one running Windows 7, one running Windows 10
● Server for redundant backups and running automated tests
● Github for version control
● Travis ci for continuous integration
● Thesaurus for common words
● Open source libraries
● Firefox
Milestones
● Assembling components to make browser plugin
● Completing thesaurus
● Checking thesaurus
● Making browser plugin
● Testing browser plugin

12
● Releasing browser plugin source code
● Releasing information gathered about the project
Risks and Dependencies
● Thesaurus may be challenging to build, will likely need to source it from an online
thesaurus company or find one in the public domain.
● May have issues running javascript libraries due to expecting to run in a nodejs
environment, there are some ways of overcoming this be will need to be investigated
further.
● Special attention will need to be taken to make sure that licenses are not violated. If a
license was to be violated code may have to be reimplemented which could take a great
deal of time if not caught early in the development cycle.
Final Deliverables
The last deliverable that will be made will be the browser plugin to anonymize user text.
Any research that is done on the tool that is being built and how effective it is for the end user
will also be released. Any modifications made to components that build the browser plugin will
be released. Any tools that are built for testing the browser plugin will be released. Automated
13
testing will have to be done with the browser plugin, this will also be released so that further
contributions can test it.
Automated testing will be done with something such as selenium or something similar.
Results of how effective the browser plugin performs in different sectors of literature will be
released. Compilation instructions and user instructions will also be provided. Instructions such
as: developer environment and dependencies used will also be provided for future contributors.
Instructions will include instructions on how to run tests on outputs and how to setup tools for
scraping to test the browser output. Users will also be directed where to download any of the
forks of projects made which made the project possible these may include but not limited to:
jstylo, anonymouth and components such as the grammar and spell checkers that are to be used
in the project.
Research found during the project will be very important to show pitfalls of the current
state of the tool. Research should be continuously redone on how well the plugin works to make
sure that it is heading in the right direction of development. Transparency is something that
should be important for a project that deals with one’s privacy and the project as a whole will try
to be as transparent as possible with the information given to the end-user and developer alike.
Team Members
Since this is an individual project I am the only group member, Steven Anderson. I will
be responsible for all researching into the topic of text identification. I will be responsible for
finding all the components to build the browser plugin with, including open source software and
a thesaurus. I will also be responsible for all testing of the browser plugin. I will be responsible
14
for building tools for validating output from the browser plugin. I will be responsible for
releasing the browser plugin.
The browser plugin will be released on github for people to compile themselves. I would
also like the browser plugin to be available on the firefox addon store as well. Once the browser
plugin is released I will be responsible for putting out information to the github repository where
the code will be published for potential future developers of the project. I will be ultimately
responsible for all parts of the capstone project and will see to my best ability that they get
completed.
Appendix
Usability Testing
Before the browser plugin is built there will have to be a number of tests that are done
through the lifecycle of design. The individual components of the browser plugin will likely need
to be modified to ensure that they work correctly on their own. Testing individual components,
tree, spelling check and other components. Components will be checked manually to see if they
are working as intended before being merged into the whole project.
One the project is built there will need to be manual verification of the output. Manual
verification of the output of the browser plugin will be done with frameworks such as jstylo.
There may be other tests done such as using machine learning algorithms but more research will
need to be completed to say for certain. Other methods should be used for testing and this is
15
subject to change as time permits, the product being available is the most important part of the
project, the plugin can become more mature after the capstone is completed.
Once initial manual verification is complete there will be additional verification done
with similar tools but modified to allow the process to be automated. Jstylo and other methods
will be used for automatic verification. There will likely have to be modification made to
existing tools and tools made to allow the process to be automated. Ideally the text that is being
analyzed should come from a variety of sources such as public domain books and open forums
so that a wide variety of data can be checked to see how effective the browser plugin is.
If time permits it would be useful for the project going forward after completion to
integrate continuous integration through a platform such as travis ci or something similar. Some
sort of automated testing should be done but a continuous integration platform such as travis may
be too complicated to completed within the 8 weeks that are given for the course. If something
such as travis is integrated tests that are done automatically but setup manually by developers of
the project should be able to be automated.
It would be ideal to do automated testing have jstylo and other tools testing with
something such as travis ci to make sure there are no regressions in how effective the browser
plugin is. Given the nature of the tool and how it will be tied automated testing may not be easily
possible or possible at all unless the browser plugin is split up into different components such as
the code tying the plugin to the browser and the vanilla javascript that is platform independent.
Manual verification will also be done by analyzing text by eye. Analyzing text without
the use of any technology may not be the most useful but it will provide an additional layer of
testing to see if the text can be recognized. While looking at many pieces of text from the same
16
source, it usually becomes obvious who has written something. Looking at the text manually and
trying to identify who has written something should provide an extra layer of credibility to how
well that the browser plugin works. If the text is manually verified hopefully any issues with the
browser plugin can be identified such as not fixing errors in one’s text that should be corrected or
by finding areas where patterns are not being erased. Manual verification ideally should be a
blind process where a script that is written gives a user a number of writing from one source and
then provides a number of writing from different sources with one of them being from the source
that the user has read multiple examples of. If a user is unable to identify by looking at the
different works then the plugin should be working at a base level. Manual testing should be done
before and after automatic testing but will be far more time consuming due to needing human
interaction through all steps of the process.

17
References
Brennan, M., Afroz, S., & Greenstadt, R. (n.d.). Deceiving Authorship Detection. Retrieved
April 17, 2018, from
events.ccc.de/congress/2011/Fahrplan/attachments/2019_28C3-authorship.pdf
How We Solved It: Stylometric Analysis. (2010, May 3). Retrieved April 17, 2018, from
http://www.pbs.org/opb/historydetectives/blog/how-we-solved-it-stylometric-analysis/
P. (2013, October 16). Psal/anonymouth. Retrieved April 17, 2018, from
https://github.com/psal/anonymouth
Psal/jstylo. (2016, June 16). Retrieved April 17, 2018, from https://github.com/psal/jstylo

Capstone Proposal Anderson

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Capstone Proposal Anderson

Transféré par

Droits d'auteur :

Formats disponibles

Anonymizing Text

Directed Group Capstone CST499-30

April 22nd 2018

Future Project Improvements 4

Stakeholders and Community 6

● To remove identifying characteristics of a user’s text.

● Build a thesaurus of the most common words for reference

● Correct spelling mistakes in text

● Swap words in text with words from the thesaurus

● Automate process for user

● Show how effective browser plugin is by doing tests

Future Project Improvements

● More in-depth into stylometry

● Using machine learning algorithms for testing

● Fully automating testing including test samples

● See how length of text effects anonymization

such as IBM for looking into effectiveness of tool

● Create sentence structure that is emotionally equivalent

● Create sentence structure that is grammatically equivalent

● Create sentence structure that is contextually equivalent

● Contributors will learn how to use many tools to test output

● Contributors will get exposure to different programming languages and styles

● Contributors will learn to simplify abstract problems

sentence length, word choices syntactic structure (Brennan).

Stakeholders and Community

everywhere and privacy badger.

the plugin is not created.

active community around them.

sentence is structured but that can always be a later addition.

I am going to have to be careful of legal issues during development of the capstone.

Week 1 Thesaurus source

Week 1 Gather open source components

Week 1 Have open source components working

Week 2 Have a prototype of the browser plugin

Week 2 Initial testing of the browser plugin

Week 3 Improving the browser plugin with data from

Week 3 Doing additional testing of the browser plugin

Week 4 Look into additional methods for testing

Week 4 Additional testing

Week 5 Wrap up testing

Week 5 Start getting the browser plugin ready for

Week 5 Gather data collected to prepare for release

Week 6 Release browser plugin

Week 6 Release data gathered

● 2 computers, one running Windows 7, one running Windows 10

● Server for redundant backups and running automated tests

● Github for version control

● Travis ci for continuous integration

● Thesaurus for common words

● Open source libraries

● Assembling components to make browser plugin

● Making browser plugin

● Testing browser plugin

● Releasing browser plugin source code

● Releasing information gathered about the project

Risks and Dependencies

thesaurus company or find one in the public domain.

deal of time if not caught early in the development cycle.

contributions can test it.

releasing the browser plugin.

the project should be able to be automated.

interaction through all steps of the process.

April 17, 2018, from