Vous êtes sur la page 1sur 7

NEWS PRODUCT

BONDEVALUE

Report & Recommendations


BondEValue – News Product
1 Module: News sourcing

1.1 Methods

For extracting relevant news articles, multiple ways can be used depending on the number
and quality of the source. Set of known URL’s can be used, where the sources would need manual
check before. The sources are known and quality and genuinity of the news. Also, another method
with relevant keywords usage to get the articles. The following are the two methods explained

a. Google alerts – Using fixed number of URLs


b. IBM Watson Discovery

a. 1.1.a. Google alerts – Using fixed number of URLs

Studying quality data of past articles in various categories to gain knowledge about key
words. These keywords can be identified past data to understand patterns of news categories and
most used key words. Amazon Comprehend is a natural language processing (NLP) service that
uses machine learning to find insights and relationships in text. Amxazon Comprehend uses
machine learning to uncover the insights and relationships in unstructured data.

service identifies the language of the text; extracts key phrases, places, people, brands, or
events (Entities); understands how positive or negative the text is; analyzes text using tokenization
and parts of speech; and automatically organizes a collection of text files by topic.

1.2 Progress & Results:


Classifier Train Dataset Accuracy Precision/Recall Remarks
1 800 - Imbalanced 76.5 83/73 Imbalanced – Poor
Performance
2 700 - Balanced 79 82/68 Low Recall
3 1100 - Balanced 84 56/45 Low Precision and Recall
4 9000 - Balanced 93.5 93/93 Good to Go
We can see from above that the classifier 4 - with 9000 training data volume is giving most

efficient results. The model can be further improved with more balanced training data.
This classifier can be utilized as the starting point and be further improved as more data is
collected.
Takeaway:
Most predictions with <70% Relevance Scores belong to General/MA or Strategy/Irrelevant
categories. Based on results, most of misclassifications occur on General/Irrelevant categories.
Hence, Classes General, Irrelevant and MA/Strategy seem to be a bit tricky. Further training
would be required to improve the performance.

1.3 Alternative: IBM Watson Classifier


IBM Watson Natural Language Classifier applies cognitive computing techniques to return best
matching predefined classes for short text inputs, such as a sentence or phrase. It can classify
phrases that are expressed in natural language into categories. It provides easy to build text
classification models for identifying best actions, organizing data, or analyzing data for trends.

1.4 Costing Comparison between AWS Comprehend and IBM Watson Classifier:
AWS Comprehend
120000 Articles/Month 150000 Articles/Month
Avg. Article length(Char) 650 650
Units Charged 780000 975000
Price/Unit ($) 0.0005 0.0005
Model Training/hour 3 3
Custom Model Management/month ($) 0.5 0.5
Total Cost/Month ($) 396.5 494

Assumptions/Considerations:
a. Average Character length of each article = 650/article
b. Models Hours trained monthly = 2hours
c. 1 unit = 100 characters
IBM Watson Classifier
120000 Articles/Month 150000 Articles/Month
Avg. Article word length 650 650
Price/API call ($) 0.0035 0.0035
No. of classifier free/ month 1 1
Cost of Classifier/month ($) 20 20
Total Cost/Month ($) 436.5 541.5

Assumptions/Considerations:
d. Average Character length of each article = 650/article
e. Classifiers utilized = 1
f. 4 training events free monthly
Takeaway: AWS Classifier offers a cost-effective solution for Classification when compared to
IBM Watson NLP classifier.
1.5 Integration with API:
Amazon API Gateway console can be used to create and test a simple REST API with the HTTP
integration. All AWS services support dedicated APIs to expose their features. However, the
application protocols or programming interfaces are likely to differ from service to service. An
API Gateway API with the AWS integration has the advantage of providing a consistent
application protocol for client to access different AWS services.
API’s can be deployed based on the server configuration i.e., through AWS command line
interface, JAVA and python.
To send batches of up to 25 documents, we can use the Amazon Comprehend batch
operations. Calling a batch operation is identical to calling the single document APIs for each
document in the request. Using the batch APIs can result in better performance for the
applications. Batch operations allow to send more documents in each request which may result
in higher throughput.

Asynchronous batches have the following limits:

Description Limit
Maximum size (UTF-8 characters) for one document, entity and key phrase detection 100 KB
Maximum size (UTF-8 characters) for one document, language detection 1 MB
Maximum size (UTF-8 characters) for one document, sentiment detection 5 KB
Total size of all files in batch 5 GB
Maximum number of files, one document per file 1,000,000

For more information on the technical side of the implementation, please refer to the following
page https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api.html

1.6 Recommendations:
a. Implementing AWS Comprehend classifier.
b. Further improvements on classifier model using balanced training datasets.
c. Integrating with existing system using AWS APIs.
2 Module: Entity Identification
2.1 Method Used: AWS comprehend Entity Recognition
The AWS comprehend Entity identification can be utilized to identify entities in the articles along
with their relevance or confidence scores. Find below certain examples of how entity recognition
results are obtained. Type tells us about the entity type – Organization/Location/Quantity/Date
etc. and the corresponding Score is the Confidence Score.

Takeaway: The entity recognition tool gives decent results pertaining to a news article. The
details provided by the Entity recognition can be captured and utilized to understand structure
and more details about the article.

2.2 Recommendations:
a. Testing of the entity feature further to explore its utility.
b. The Entity recognition feature can be implemented once classifier is implemented and
performing as per expectations.
3 Final Recommendations:
 Classifier: AWS Comprehend to be used as it offers good performance and is cost effective as
compared to IBM Watson classifier.
 Tech team to Focus efforts on the Key Improvement Areas in the classifier to further improve
performance.
 The implementation can be broken broadly into 2 phases:
 Phase I – The news sourcing Module and Classifier Module need to be integrated.
o The tech team should explore the integration within modules and integration with
the existing solution architecture.
 Phase II – The solution to be integrated with Entity Recognition module.
o Once The phase I implementation is up and running, the entity recognition can be
integrated.

Vous aimerez peut-être aussi