Vous êtes sur la page 1sur 18

Anonymizing Text

Steven Anderson

Directed Group Capstone CST499-30

April 22nd 2018


1

Table of Contents
Table of Contents 1

Executive Summary 2

Introduction 3

Project Objectives 4

Future Project Improvements 4

Contributor Objectives 5

Literature Review 5

Stakeholders and Community 6

Approach 7

Ethical Considerations 8

Legal Considerations 9

Project Scope 10
Timeline 10
Resources 11
Milestones 11
Risks and Dependencies 12

Final Deliverables 12

Team Members 13

Appendix 14
Usability Testing 14

References 17
2

Executive Summary
While online it is possible to identify a person by the way that they type. This capstone
project is to try and help users make their text non-identifiable online. The project will try to give
feedback on how well the browser plugin that is to be created as a final product works. By
building a browser plugin it should hopefully make the barrier to entry for such a product lower
than current solutions which are tools that are desktop applications and similar applications. By
providing an easier solution hopefully more people will be able to adopt the tool and will be able
to get a number of developers working on the project as current solutions seem to not be actively
in development. The tool will likely be used with other tools that people use who are concerned
about privacy and will become an additional tool to use. The browser plugin should make the
process completely automated for the user. Automated and other test results will be shown to
users to make users aware of well well the browser plugin works. Transparency will be important
for users and developers alike.
To try and make it so a user’s text cannot be identified a browser plugin will be built to
try and change the way that the user’s text after they type something. To try and make a user’s
text non-identifiable there will be many steps including: fixing spelling errors and replacing
words with other similar words. By making text non-identifiable the project should help people
maintain their privacy when talking in many different online forums.​ ​The goal of the project to
encourage adoption of the browser plugin is to be as transparent as possible due to the project
being open source and giving results of the outcome of research. To try to further transparency
all methods of how text transformation will be documented in plain English, explaining the
whole process of what goes into trying to anonymize text.
There should not a population that is negatively affected by the outcome of the capstone.
While the capstone may be used for negative purposes, this is not the intention of the capstone
project. The capstone should hopefully help people who are concerned about their privacy for a
number of different reasons. This tool should hopefully help those who may be negatively
affected if they are just trying to speak their mind and give them an easier way to do so without
as much fear of retaliation. The project should hopefully not negatively affect people and ways
of combating misuse will be taken into account throughout the lifecycle of the project and after
the release of the capstone. Some possible solutions to try to combat misuse is to allow platforms
to identify if someone is using the tool. It may be a problem if platforms do no not adopt
identification of the tool and I could see many false positives with trying to identify the browser
plugin. If the tool becomes popular enough platforms that are negatively affected will be
encouraged to try to mitigate people from misusing using this tool. While there will be misuse
there will be a high priority on combating misuse.
The end goal of the project is to allow a user to easily anonymize their text. Hopefully the
way that the project is setup will allow developers to contribute in the future. The initial release
of the plugin will get the job done but not very well. Improvements could be made in the future
to the plugin to allow for more coherent replacements of text and possibly learning how the user
types to better understand what to replace. Though current solutions allow users to anonymize
their text it seems that there is a demand for an easier solution. Hopefully by presenting users an
easier solution the project will likely allow a great number of users to adopt the technology due
3

to how low the barrier is compared to current solutions. The end result should be a working
product and a framework for developers to work with in the future and be able to build off of in
the future.

Introduction

While using the internet many try to protect their privacy. Privacy is something that a

number of people are concerned about but there are many aspect that one should be concerned

about. The way that a person types on a computer can be identified through a practice called

stylometry. Stylometry is looking at patterns in a person’s speech and recognizing them. The

practice of stylometry can likely be used to identify people across multiple platforms with a

reasonable degree of accuracy without any other clues about their identity.

While stylometry can be used to find out the author of a piece of text there are ways to

hide who wrote something. There are currently ways to anonymize text but current solutions are

not very user friendly. The product that I will be making will be a browser plugin that will allow

a user to easily anonymize their text in the browser. The browser plugin will take various steps to

try to anonymize the user’s text. The first step is by correcting any spelling errors in the user’s

text. By correcting spelling errors we are removing one way of identifying a person. The next

step is to break down the sentence structure of the user’s text. Once the text is broken down the

plugin will then start replacing some of the words in the user’s sentence with other words from a

thesaurus that has been built. After these steps have been completed the text should be

anonymized.
4

Anonymizing text is just one step in trying to protect one’s privacy online. The end result

will hopefully help people who are concerned with privacy. Upon researching the topic there

were no solutions that made anonymizing text as simple as using browser plugin. The browser

plugin should help user’s keep different parts of their lives separate as the internet evolves.

Project Objectives

● To remove identifying characteristics of a user’s text.

● Build a thesaurus of the most common words for reference

● Correct spelling mistakes in text

● Swap words in text with words from the thesaurus

● Automate process for user

● Show how effective browser plugin is by doing tests

Future Project Improvements

● More in-depth into stylometry

● Using machine learning algorithms for testing

● Fully automating testing including test samples

● See how length of text effects anonymization

● Support other browser platforms, the test case will only support firefox

● Look into more premium options for testing, i.e. using platforms created by companies

such as IBM for looking into effectiveness of tool


5

● Create sentence structure that is emotionally equivalent

● Create sentence structure that is grammatically equivalent

● Create sentence structure that is contextually equivalent

Contributor Objectives

● Contributors will learn how to use many tools to test output

● Contributors will get exposure to different programming languages and styles

● Contributors will learn to simplify abstract problems

Literature Review

The process of anonymizing text can take a variation of different approaches. Upon

researching the topic there seems to be a number of ways for people to be identified through

stylometry. One way that PBS was able to identify an author was based off of how often a word

was repeated (PBS). While the PBS article is useful it does not exactly line up with what I am

trying to accomplish.

One tool that will be useful during researching the topic and has given insight into the

topic is jstylo. (jstylo). Jstylo was developed at the privacy, security and automation lab at

Drexel University in Philadelphia. Jstylo can be used to attribute works to an author. The other

part of the anonymization process that has been done is in the program anonymouth

(anonymouth). Anonymouth is the implementation of jstylo that does the anonymization. These
6

tools were both written to try and combat stylometry, the research states they are focusing on

sentence length, word choices syntactic structure (Brennan).

Jstylo and anonymouth seem to be the only some of the only tools available to anonymize

text. There has been other research into stylometry and implementing programs to try and

combat it but the research seems few and far between. While not specifically under stylometry it

seems there are machine learning algorithms out there that may help to identify text, this was

briefly talked about in the research on jstylo and anonymouth (Brennan). At this time there is not

much research on the topic specifically but there is a large amount of research about the topic of

stylometry. The research from jstylo will be invaluable when I start trying to build the browser

tool.

Stakeholders and Community

The browser plugin that is to be built will be built with the privacy of the user in mind.

The project will be completely open source, this includes any research that is done on the topic.

Transparency will be the goal including how effective the tool is when ran through different

forms of identification. The likely users of this will be people who are already concerned about

their privacy and use tools such as virtual private networks and browser plugins such as https

everywhere and privacy badger.

The project should hopefully help those who are concerned about their privacy. As the

internet grows so does my concern for privacy. I personally believe in building this tool to help

others who like me are concerned about their privacy. While there are no formal outside
7

stakeholders there is a probable gain of privacy if the tool is created. Nothing will really be lost if

the plugin is not created.

The browser plugin will likely be maintained by myself and other people in the open

source community who are interested in the project. Currently I am the only stakeholder in the

project with potential stakeholders in the future and a community that may be interested. From

my time of looking people have seem to have been interested in this idea of a project. Like stated

previously there are jstylo and anonymouth that accomplish similar goals, but do not have an

active community around them.

The end goal is to make people interested in the project after it is released. Hopefully a

community will be built from the plugin being released. With more people working on the

project I hope that the project will be able to go further then when I originally work on it and

release it.

Approach

By the process of anonymizing text it is also important to try and deanonymize it. Before

I can begin working on the project I will have to lay groundwork for the project. My first goal is

to build the browser plugin. Once the browser plugin is built and is functional at a basic level

then I will move onto collecting data. I will be using data collected from various sources online

and running them through the browser plugin and then testing the output against tools such as

jstylo (jstylo). With jstylo I will try and identify the author and see how effective it it. To get the

data that is collected I will be using a variety of python scripts, manually scraping data and

downloadable files to try and get a wide variety of texts to test on.
8

When breaking down sentences I will be looking at word choice and spelling of words.

By switching different words in a sentence and removing spelling mistakes I should get rid of

most of the identifying data from a user’s sentence based off of the research from jstylo and

anonymouth (Brennan). I know that I may have to look into other considerations of how a

sentence is structured but that can always be a later addition.

After my initial tests of running outputs through jstylo I am interested in looking into

other ways of identifying text. The research from anonymouth briefly mentions machine learning

as an avenue for identifying text and I am interested in exploring it if time permits (Brennan).

Once the plugin is created and research is done I will be releasing any tools that I have created to

facilitate the research, the research itself, how effective the outcome was and the browser plugin

itself. Documentation will be included with the research on how everything works including:

running tools, methods for scraping data and how the project works as a whole.

Ethical Considerations

During the making of the capstone I cannot foresee any major ethical issues that could

occur. As long as I cite all of my information and properly use all licenses I do not see any

ethical issues during the building of the capstone. After the capstone is built I could see it being

used for plagiarizing material but I do not believe it will be effective in aiding people in

plagiarizing because of how inconsistent the language would be. After the project is completed

and if time permits research will be done to see how it affects plagiarization.
9

There is not likely any underprivileged groups that would be directly negatively impacted

by the capstone. The capstone could potentially be used to negatively impact underprivileged

groups by way of hate speech but that is not the intention of the project. There is likely no way to

prevent the misuse of the tool that is going to be made for the capstone but sample inputs and

outputs could be given to try and curb people misusing the tool. By providing inputs and outputs

platforms could potentially try to do identification on the tool, they would not be able to find out

the user most likely but may be able to stop the usage.

Due to the tool purpose of being used by different users it is difficult to gauge all of the

ethical concerns with the project. Throughout the development of the project there will be more

investigation into stopping users from misusing the tools made. A potential solution besides

giving examples is running identification on outputs from the tool itself to see if users can be

identified in a similar manner to before their speech is anonymized. Ultimately it will have to be

up to different platforms to try and stop users from misusing tools to anonymize their speech.

Legal Considerations

I am going to have to be careful of legal issues during development of the capstone.

There will need to be care taken when picking libraries to help in the development of the

capstone to ensure that licenses of the libraries and the final product do not conflict. In general

open source software is going to be used and as a result the terms of the licenses will need to be

met as many different licenses have different clauses that must be fulfilled to use the software.
10

Another possible concern is the thesaurus that needs to be built. Thesaurus companies

may have to be contacted so a thesaurus can be built for the capstone. Licensing will need to be

dealt with when building the thesaurus unless an open source or public domain thesaurus can be

used. Besides the previously stated issues there should be no problems legally with the project.

As long as licenses are abided by and licenses are used that absolve creators of any liability there

should be no issue with building tools to anonymize text and releasing them for users to use.

Project Scope

Timeline

Complete by To complete

Week 1 Thesaurus source

Week 1 Gather open source components

Week 1 Have open source components working


individually in the browser

Week 2 Have a prototype of the browser plugin


complete

Week 2 Initial testing of the browser plugin

Week 3 Improving the browser plugin with data from


testing of the browser plugin

Week 3 Doing additional testing of the browser plugin

Week 4 Look into additional methods for testing


output

Week 4 Additional testing

Week 5 Wrap up testing


11

Week 5 Start getting the browser plugin ready for


release

Week 5 Gather data collected to prepare for release

Week 6 Release browser plugin

Week 6 Release data gathered

Resources

● 2 computers, one running Windows 7, one running Windows 10

● Server for redundant backups and running automated tests

● Github for version control

● Travis ci for continuous integration

● Thesaurus for common words

● Open source libraries

● Firefox

Milestones

● Assembling components to make browser plugin

● Completing thesaurus

● Checking thesaurus

● Making browser plugin

● Testing browser plugin


12

● Releasing browser plugin source code

● Releasing information gathered about the project

Risks and Dependencies

● Thesaurus may be challenging to build, will likely need to source it from an online

thesaurus company or find one in the public domain.

● May have issues running javascript libraries due to expecting to run in a nodejs

environment, there are some ways of overcoming this be will need to be investigated

further.

● Special attention will need to be taken to make sure that licenses are not violated. If a

license was to be violated code may have to be reimplemented which could take a great

deal of time if not caught early in the development cycle.

Final Deliverables

The last deliverable that will be made will be the browser plugin to anonymize user text.

Any research that is done on the tool that is being built and how effective it is for the end user

will also be released. Any modifications made to components that build the browser plugin will

be released. Any tools that are built for testing the browser plugin will be released. Automated
13

testing will have to be done with the browser plugin, this will also be released so that further

contributions can test it.

Automated testing will be done with something such as selenium or something similar.

Results of how effective the browser plugin performs in different sectors of literature will be

released. Compilation instructions and user instructions will also be provided. Instructions such

as: developer environment and dependencies used will also be provided for future contributors.

Instructions will include instructions on how to run tests on outputs and how to setup tools for

scraping to test the browser output. Users will also be directed where to download any of the

forks of projects made which made the project possible these may include but not limited to:

jstylo, anonymouth and components such as the grammar and spell checkers that are to be used

in the project.

Research found during the project will be very important to show pitfalls of the current

state of the tool. Research should be continuously redone on how well the plugin works to make

sure that it is heading in the right direction of development. Transparency is something that

should be important for a project that deals with one’s privacy and the project as a whole will try

to be as transparent as possible with the information given to the end-user and developer alike.

Team Members

Since this is an individual project I am the only group member, Steven Anderson. I will

be responsible for all researching into the topic of text identification. I will be responsible for

finding all the components to build the browser plugin with, including open source software and

a thesaurus. I will also be responsible for all testing of the browser plugin. I will be responsible
14

for building tools for validating output from the browser plugin. I will be responsible for

releasing the browser plugin.

The browser plugin will be released on github for people to compile themselves. I would

also like the browser plugin to be available on the firefox addon store as well. Once the browser

plugin is released I will be responsible for putting out information to the github repository where

the code will be published for potential future developers of the project. I will be ultimately

responsible for all parts of the capstone project and will see to my best ability that they get

completed.

Appendix

Usability Testing

Before the browser plugin is built there will have to be a number of tests that are done

through the lifecycle of design. The individual components of the browser plugin will likely need

to be modified to ensure that they work correctly on their own. Testing individual components,

tree, spelling check and other components. Components will be checked manually to see if they

are working as intended before being merged into the whole project.

One the project is built there will need to be manual verification of the output. Manual

verification of the output of the browser plugin will be done with frameworks such as jstylo.

There may be other tests done such as using machine learning algorithms but more research will

need to be completed to say for certain. Other methods should be used for testing and this is
15

subject to change as time permits, the product being available is the most important part of the

project, the plugin can become more mature after the capstone is completed.

Once initial manual verification is complete there will be additional verification done

with similar tools but modified to allow the process to be automated. Jstylo and other methods

will be used for automatic verification. There will likely have to be modification made to

existing tools and tools made to allow the process to be automated. Ideally the text that is being

analyzed should come from a variety of sources such as public domain books and open forums

so that a wide variety of data can be checked to see how effective the browser plugin is.

If time permits it would be useful for the project going forward after completion to

integrate continuous integration through a platform such as travis ci or something similar. Some

sort of automated testing should be done but a continuous integration platform such as travis may

be too complicated to completed within the 8 weeks that are given for the course. If something

such as travis is integrated tests that are done automatically but setup manually by developers of

the project should be able to be automated.

It would be ideal to do automated testing have jstylo and other tools testing with

something such as travis ci to make sure there are no regressions in how effective the browser

plugin is. Given the nature of the tool and how it will be tied automated testing may not be easily

possible or possible at all unless the browser plugin is split up into different components such as

the code tying the plugin to the browser and the vanilla javascript that is platform independent.

Manual verification will also be done by analyzing text by eye. Analyzing text without

the use of any technology may not be the most useful but it will provide an additional layer of

testing to see if the text can be recognized. While looking at many pieces of text from the same
16

source, it usually becomes obvious who has written something. Looking at the text manually and

trying to identify who has written something should provide an extra layer of credibility to how

well that the browser plugin works. If the text is manually verified hopefully any issues with the

browser plugin can be identified such as not fixing errors in one’s text that should be corrected or

by finding areas where patterns are not being erased. Manual verification ideally should be a

blind process where a script that is written gives a user a number of writing from one source and

then provides a number of writing from different sources with one of them being from the source

that the user has read multiple examples of. If a user is unable to identify by looking at the

different works then the plugin should be working at a base level. Manual testing should be done

before and after automatic testing but will be far more time consuming due to needing human

interaction through all steps of the process.


17

References

Brennan, M., Afroz, S., & Greenstadt, R. (n.d.). Deceiving Authorship Detection. Retrieved

April 17, 2018, from

events.ccc.de/congress/2011/Fahrplan/attachments/2019_28C3-authorship.pdf

How We Solved It: Stylometric Analysis. (2010, May 3). Retrieved April 17, 2018, from

http://www.pbs.org/opb/historydetectives/blog/how-we-solved-it-stylometric-analysis/

P. (2013, October 16). Psal/anonymouth. Retrieved April 17, 2018, from

https://github.com/psal/anonymouth

Psal/jstylo. (2016, June 16). Retrieved April 17, 2018, from https://github.com/psal/jstylo

Vous aimerez peut-être aussi