Vous êtes sur la page 1sur 58

A REPORT

ON

EXTRACTION OF CHEMICAL COMPONENTS FROM HERBAL SPECTRA USING


INDEPENDENT COMPONENT ANALYSIS

BY

Md.Neha Tabassum 2016A8PS0444H Electronics and Instrumentation


Vaishnavi Khariya 2016A8PS0416H Electronics and Instrumentation

Prepared in partial fulfillment of the


Practice School-I Course
AT

CSIR- Central Electronics Engineering Research Institute,


Chennai.

A PRACTICE SCHOOL –I STATION OF

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI


HYDERABAD CAMPUS
JULY, 2018

1
A REPORT

ON

EXTRACTION OF CHEMICAL COMPONENTS FROM HERBAL SPECTRA USING


INDEPENDENT COMPONENT ANALYSIS

BY

Md.Neha Tabassum 2016A8PS0444H Electronics and Instrumentation


Vaishnavi Khariya 2016A8PS0416H Electronics and Instrumentation

Prepared in partial fulfillment of the


Practice School-I Course

AT

CSIR-Central Electronics Engineering Research Institute,


Chennai.

A PRACTICE SCHOOL –I STATION OF

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI


HYDERABAD CAMPUS
​JULY, 2018

2
​BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE
PILANI (RAJASTHAN​)
​Practice School Division

Station:​ ​Central Electronics Engineering Research Institute ​Centre:​ Chennai

Duration:​ 8 Weeks

Date​ ​Of​ ​Start​: 22-May-2018

Date Of Submission: ​9-July-2018

ID No./Name/
Discipline of the student:​ 2016A8PS0444H,Md Neha Tabassum,Electronics & Instrumentation

Project Title: ​Extraction of chemical components from herbal spectra using Independent
Component Analysis

Name and Designation ​Mr.C.Kumaravelu,Senior Principal Scientist


Of the expert:

Name of the PS Faculty:​ Ms. Akshaya Ganesan

Keywords: ​Near Infrared Spectroscopy(NIRS), Independent Component Analysis(ICA)​ ,


​Fourth Order Blind Algorithm(FOBI)

Project Areas: ​NIR spectra,ICA techniques

Abstract:​.The need for the identification of hidden components in an adulterated mixture and to
qualitatively assess various medicines has been growing day by day.With the advent of machine
learning and signal analytics,the task has been made fast and facile .The aim is to automate the
process of identifying the hidden individual components in the NIR spectra of the mixture​.W​ e
proposed an implementation to perform Independent Component Analysis using MATLAB.The
code is capable of finding the independent components of a mixture.Subsequently the plots of
independent components are obtained and spectral similarity techniques are applied to identify
the actual independent components..This can help in building an effective as well as efficient
on-line instrument for quality assessment in herbal medicine industry.

Signature of Student Signature Of PS Faculty


Date 9th JULY 2018 Date

3
ACKNOWLEDGEMENTS

As a part of my curriculum, I am doing my PS-1 program at CENTRAL


ELECTRONICS ENGINEERING RESEARCH INSTITUTE, CHENNAI. I am
thankful to a number of people for their guidance.

I would express my appreciation and gratitude for the help and guidance of my
Project co-ordinator, Mr. C. Kumaravelu (Senior Principal Scientist), for helping
me throughout the project and investing his precious time and inputs on my
project.

I would express my gratitude for my PS-1 instructor Ms.A.Ganesan , especially for


her encouragement and guidance.

I would also like to thank the current Scientist-In-Charge Dr. A. Gopal at CEERI
Chennai, CMC,Tharamani, for providing me this opportunity to work at such an
esteemed institute on a project for my PS1.

There are many more people like the employees, support staff and others who have
ensured an enjoyable time spent at the offices and laboratories of CEERI, allowing
us the freedom and space to work without distractions. We would like to thank
them all. We regret our inability to mention the names of everyone.

I would also like to acknowledge the help and support shown towards me by
Mr.D.Palaniandi (Student Project Co-ordinator), and especially the warmth
provided by him assisted us in having an enjoyable time in a new city at such a
prestigious institute.

Thank you.

4
TABLE OF CONTENTS
List of Abbreviations 7

About CEERI Chennai 8

1 INTRODUCTION

1.1 Near Infrared spectroscopy

1.1.1 Theory 9

1.1.2 Applications 10

1.2 Methods in NIR spectral analysis

1.2.1 Multiple Linear Regression (MLR) 11

1.2.2 Principal Component Regression 11

1.2.3 Partial Least Squares (PLS) 12

1.2.4 Fourier and Wavelet Analysis 12

1.2.5 ICA 13

1.3 Independent Component Analysis

1.3.1 Theory 13

1.3.2 Applications 15

2 SCOPE/AIM AND OBJECTIVES OF THE 16

PROJECT

3 METHODOLOGY 17

3.1 Preprocessing the raw data 17

3.2 Understanding of ICA 18

5
3.2.1 Discussion with our scientist 18

3.2.2 Research Papers 20

3.2.3 Videos and other reliable sources 25

3.3 Design of Matlab Code 26

3.4 Obtaining the graphical data 26

3.4.1 Plots of the IC’s and the estimated 26


mixture data
3.4.2 Mathematical Spectral Similarity 27
Measures

4 RESULTS AND DISCUSSION 29

4.1 Output Plots

4.1.1 Analysis of data sample1 29

4.1.2 Analysis of data sample2 31

5 CONCLUSIONS 52

6 RECOMMENDATION 53

7 REFERENCES 54

8 APPENDIX

8.1 APPENDIX A
8.2 APPENDIX B
8.3 APPENDIX C
8.4 APPENDIX D
8.5 APPENDIX E

6
LIST OF ABBREVIATIONS

1. CEERI Central Electronics Engineering Research Institute

2. ICA Independent Component Analysis

3. NIRS Near Infrared Spectroscopy

4. R&D Research and Development

5. BSS Blind Source Separation

6. MLR Multiple Linear Regression

7. PLS Partial Least Squares

8. MATLAB Matrix Laboratory

9. SVD Singular Value Decomposition

10. NIT Near Infrared Transmission

11. FOBI Fourth Order Blind Identification

12. IC’s Independent Components

7
CEERI Chennai

CEERI Centre, Chennai, is a pioneering institution in developing indigenous


technologies for the automation of Indian process industries. As the regional center
of Central Electronics Engineering Research Institute, Pilani, the Centre had a
humble beginning in 1974 and got transformed into a full-fledged R&D Centre in
1979.The Centre has developed many novel automation systems using stat-of-art
techniques for pulp & paper, food, leather, chemical and plastic industries. Later,
the Centre has ventured into the development of machine vision systems for
societal needs. The designs adopt the latest advancements in specialized areas such
as the image processing, fiber-optic and near infrared instrumentation towards
better quality management and value addition to Indian industrial produce.
The Centre has a well equipped optics laboratory for the development of electro
optical systems, DSP and FPGA lab and rapid prototyping environment for the
development of electronic systems hardware and firmware, latest high resolution
CCD cameras & workstations with advanced imaging and computing
software/hardware to develop machine vision technologies for on-line inspection
and grading. The Centre has excellent workshop, CAD & drafting facilities to cater
to the needs of the R&D activities.
The Centre has developed an image analysis based system - “Herbas”, for the
authentication of certain herbal plant raw material in collaboration with IICT,
Hyderabad and Arya Vaidya Sala, Kottakkal. The system captures the image of the
cross section of the stem using a trinocular microscope with a CCD camera and the
software offers several capabilities for the user to authenticate the plant for its
medicinal use.The Centre also has completed the project on “Development of
on-line mango sorting system using soft x-ray imaging” by developing a soft X-ray
imaging based technology for analyzing export variety mangoes for internal
disorders and rejecting the infected ones by using a suitable conveyor
arrangement.Also the project on “Paper dirt speck analyzer” has been successfully
completed.
CEERI Centre ,Chennai has been doing well in the past and has ambitious plans to
develop near-infrared instrumentation, electro optical and machine vision system
technologies for quality assessment/assurance, grading, sorting of a variety of
agricultural/horticultural products and processed foods to help in reducing wastage,
optimize energy usage, enhance productivity, value addition and compete in global
markets.

8
1.INTRODUCTION

1.1 ​NIR spectroscopy

1.1.1 ​Theory

Near Infrared Spectroscopy (NIR) is a type of vibrational spectroscopy ​that is


based on molecular overtone and vibrational combinations. It employs photon
energy (hn) in the energy range of 2.65 x 10​-19 to 7.96 x 10​-20 J, which corresponds
to the wavelength range of 750 to 2,500 nm ​(wavenumbers: 13,300 to 4,000 cm​-1​)
as shown in Fig 1.The discovery of near-infrared energy is ascribed to William
Herschel in the 19th century, but the first industrial application began in the 1950s
It is the study of the interaction between a sample (e.g. cereals, seeds, oils, finished
products) and infrared light that has been dispersed into individual wavelengths
.One advantage is that NIR can typically penetrate much farther into a sample than
mid infrared radiation.​[1]

Fig1​.The NIR spectra extends from 750-2500 nm

9
1.1.2​ ​Applications

Although near-infrared (NIR) spectroscopy is not a particularly sensitive


technique, it can be implemented with little or no sample preparation and thus is
well suited to applications such as process monitoring, materials science, and
medical uses.Some applications include physiological diagnostics and research
including blood sugar, pulse oximetry, functional neuroimaging, sports medicine,
elite sports training, ergonomics, rehabilitation, neonatal research, brain computer
interface,urology (bladder contraction), and neurology (neurovascular coupling).
There are also applications in other areas as well such as pharmaceutical, medical
diagnostics, atmospheric chemistry, and combustion research.​[2]​NIR spectroscopy
can also be implemented for measuring quality attributes of fruits and vegetables as
shown in Fig2.

Fig2.​NIR reflectance spectra of some fruit

NIR spectra as shown in Fig2. can be obtained using a spectrophotometer.An ​NIR


spectrophotometer consists of a light source (usually a tungsten halogen light
bulb), sample presentation accessory,monochromator, detector, and optical
components, such as lenses, collimators, beam splitters, integrating spheres and
optical fibers.

10
1.2 ​Methods used for NIR spectral analysis

1.2.1 ​Multiple Linear Regression (MLR)

According to Beer’s Law,


​ C = AB ​ where C=Concentration
A=Absorbance
B=Matrix of coefficients
So if C and A are known, B can be solved by multiple linear regression
​ B= (A​T​A)​-1​A​T​C
The only requirement is to select the wavelengths that correspond to the
absorbances of the desired constituents.
The MLR can be used to accurately build models for complex mixtures when only
some of the constituent concentrations are known.However, it has several
drawbacks.For instance, when too much information in the spectrum is used in
calibration, the model starts to include the spectral noise that is unique to the
training set and the prediction for unknown samples deteriorates.​[3]

1.2.2 ​Principal Component Regression (PCR)

The PCR method combines principal component analysis (PCA)of the spectra with
an MLR to devise a quantitative model for complex samples.
PCA decomposes the original spectra matrix into several principal components,
​A=SF ​where A=Spectra matrix
​S=Scaling score matrix
F=Principal component matrix
Here the S matrix is supposed to be linearly correlated to the concentration matrix
C, then it is the next step to regress C against the S matrix by the MLR model
​ C = SB
Therefore it can be solved that
​B= (S​T​S)​-1​S​T​C
For unknown samples, A, B, and F are known and S is obtained from
​ S= AF​T
and then the concentration matrix C can be obtained.As a modeling method, PCA
is somewhat less accurate than MLR when both are used within the calibration

11
range and when the model is indeed linear. However it is often more reliable than
the MLR if extrapolation is needed.​[3]

1.2.3 ​Partial Least Squares (PLS) Regression

The purpose of PLS is to build the linear model between the concentration matrix
C and the spectral matrix A,
​ C =AB
In comparison to the MLR method and PCR methods, PLS produces a weight
matrix W for A such that
​ S=AW
, i.e., the columns of W are weight vectors for the A columns producing the
corresponding score matrix S.
These weights are computed so that each of them maximizes the covariance
between the concentration matrix C and the spectra matrix A. C can then be
decomposed as
C =SF where F =Loading matrix for C
Once F is computed, there is
​ C =AB​ where​ B =WF
PLS adopts a single-step decomposition and regression technique and therefore is
faster than PCA although it is more abstract and difficult to explain.However it
uses the concentration information during the decomposition process, and this
causes the spectra containing higher constituent concentrations to be weighted
more heavily than those with low concentrations.​[3]

1.2.4 ​Fourier and Wavelet Analysis

​ ourier analysis transforms the spectral data in the wavelength domain into the
F
frequency domain, where the original spectra are approximated to a desired degree
of accuracy by sums of periodic sine and cosine functions of increasing frequency.
Wavelet analysis has recently been studied in NIR spectra data processing. It
proves to be more powerful than Fourier transform in capturing the local features
of a spectrum because it can decompose a signal into components that are well
localized in both the time and frequency domains, while Fourier transform can
only reflect upon the frequency information.
However the foregoing reviewed methods are mainly for calibration and prediction
purposes, and none is designed for the purpose of ​blind source spectra

separation​.[3]

12
1.2.5 ​Independent Component Analysis

I​n signal processing, ​independent component analysis (​ICA​) is a computational


method for separating a multivariate signal into additive subcomponents. This is
done by assuming that the subcomponents are ​non-Gaussian signals and that they
are ​statistically independent from each other. ​ICA finds the independent
components (also called factors, latent variables or sources) by maximizing the
statistical independence of the estimated components. We may choose one of many
ways to define a proxy for independence, and this choice governs the form of the
ICA algorithm. The two broadest definitions of independence for ICA are
1. Minimization of mutual information
2. Maximization of non-Gaussianity
ICA is important to blind signal separation and has many practical applications. It
is closely related to (or even a special case of) the search for a factorial code of the
data, i.e., a new vector-valued representation of each data vector such that it gets
uniquely encoded by the resulting code vector (loss-free coding), but the code
components are statistically independent.​[3]

1.3 ​Independent Component Analysis

1.3.1 ​Theory

ICA belongs to a class of blind source separation (BSS) methods for separating
data into underlying informational components, where such data can take the form
of images, sounds, telecommunication channels,spectral signals or stock market
prices.It was developed in the 1990s. The term “blind” is intended to imply that
such methods can separate data into source signals even if very little is known
about the nature of those source signals.The goal of ICA is to solve BSS problems
which arise from a linear mixture.

13
The most basic example of ICA is the ​Cocktail Party problem which is clearly
explained in Fig 3.

Fig3. ​This figure explains how can we separate two independent source
signals from the mixed signals obtained in the two microphones.

How Independent Component Analysis Works?


ICA is based on the simple, generic and physically realistic assumption that if
different signals are from different physical processes (e.g., different people
speaking) then those signals are statistically independent. ICA takes advantage of
the fact that the implication of this assumption can be reversed, leading to a new
assumption which is logically unwarranted but which works in practice, namely: if
statistically independent signals can be extracted from signal mixtures then these
extracted signals must be from different physical processes (e.g., different people
speaking). Accordingly, ICA separates signal mixtures into statistically
independent signals. If the assumption of statistical independence is valid then each
of the signals extracted by independent component analysis will have been
generated by a different physical process, and will therefore be a desired signal.​[4]

14
1.3.2 ​Applications

ICA has been applied to problems in elds as diverse as speech processing, brain
imaging(e.g.,fMRI and optical imaging)electrical brain signals(e.g.,EEG signals),
telecommunications, and stock market prediction. However, because independent
component analysis is an evolving method which is being actively researched
around the world, the limits of what ICA may be good for have yet to be fully
explored.​[13]
The achievements in the field of ICA has made life much easier.Some of the major
applications are listed below-

1) Freshness measurement of eggs


2) The EEG data of the brain is a mixture of hundreds of electrical signals of
the brain and can be individualised using ICA.
3) Text Mining-for automatic analysis of large textual databases to find the
topics of documents and grouping them accordingly
4) Detect concentration changes of oxyhaemoglobin and deoxyhaemoglobin in
the human brain
5) Removing artefacts and noise from recorded biosignals

15
2. Aim of the project

Our basic interest is to analyse the NIR spectra of any given mixture(as shown in
Fig4.) and to discover the various ​independent components​(as shown in Fig4.1
and Fig 4.2) as well as their concentrations by using ICA technique in
MATLAB.This is a ​blind source problem​.Hence we cannot work with the usual
methods of MLR,PLS,Fourier and Wavelet transforms which require the
knowledge of references beforehand. However, ICA is an efficient technique
which can counteract this problem and is capable of removing the noise more
effectively without damaging the pure signals.
After reconstructing various combinations of IC’s and comparing them with the
actual mixture,we will be able to determine the real independent components along
with the adulterants which is the primary purpose of this project.

Fig4. ​NIR spectra(absorbance vs wavelength) of a mixture(A+B)

With the help of this project we should be able to obtain the following 2 individual
spectras-

​Fig 4.1 ​NIR spectra of component A ​Fig 4.2 ​NIR spectra of component B

16
3.Methodology

The following sequence of steps were followed in order to find and authenticate
the independent components in any given herbal medicine -

1.)Preprocessing the raw data


2.)Understanding of ICA
3.)Design of a Matlab code
4.)Obtaining the plots and correlation coefficients to draw conclusions

3.1 ​Preprocessing the raw data


The collected data samples must be Baseline(SNV) corrected before feeding them
to the code.Baseline correction is done especially to remove the "background
noise"(Unnecessary peaks), Also,it makes the data graphs easier to work
with.​Standard normal variate (SNV) is an effective pretreatment method for
baseline correction of diffuse reflection NIR spectra of powder and granular
samples.However, its baseline correction performance depends on the NIR region
used for SNV calculation. Moreover SNV approach effectively removes the
multiplicative interferences of scatter and particle size.​[5] ​Fig 5.1 and Fig 5.2
display the difference in NIR spectra of a sample before and after Baseline and
SNV correction respectively.

Fig5.1 ​NIR spectra of a sample ​Fig5.2 ​NIR spectra of​ ​corrected data

17
There are complex mathematical formulae involved to perform these corrections
but we used a software called ​‘UNSCRAMBLER’ which automatically gives the
corrected data once we feed the unprocessed data.

Sometimes to increase the accuracy of the method ,the ​second order derivative ​of
the raw data(Fig 6.1) can be considered as it makes the peak sharper and much
easier to take the information.Also,the signal to noise ratio increases as shown in
Fig 6.2.


Fig6.1 ​NIR spectra of banana residues​[6]​ ​Fig6.2 ​Second derivative spectra of the
banana residues​[6]

3.2 ​Understanding of ICA

Our deep understanding of ICA,how it’s performed, it’s applications,various


methods employed was accomplished in three major steps.

3.2.1 ​Discussions with our scientist

Initially,a brief description of the ongoing research was given by the scientist on
the day of orientation.Different MATLAB codes had to be written in order to
analyse the NIR spectra.All this was explained with the help of an example.The
example was about the spectral analysis of a mixture of water, starch, and protein.
Though in practical application, the constituent components of mixtures may not
be known and the ICA approach is supposed to identify these individual
components through blind separation of their spectra, in the example here, the
mixtures of known components are used to illustrate and test the proposed method.

18
Independent Component Analysis is a technique which can be applied on any
mixture data to obtain its Independent Components.Here, ICA is applied to a set of
data containing 10 samples of a starch,protein and water mixture as shown in
Fig7.2.The IC’s shown in Fig7.3 can be obtained using the code given in Appendix
B ,they can be compared with the references shown in Fig7.1.

Fig7.1 The reference NIR spectra of


starch ,protein and water (the peaks
signify the response of a particular bond
at a particular wavelength)​[3]

Fig7.2 NIR spectra of the mixture​[3] ​Fig7.3 ​NIR spectra of the IC’s obtained
on ICA​[3]

19
In another example different sugar samples containing varied amount of sugar and
water were prepared.Further, NIR spectra of each sample was examined and
independent component analysis was performed on them.After getting the spectra
of IC’s and comparing it with the reference spectra of sugar and water it was much
easier to assess the C-H,O-H,C-C ,C=C and C=O bonds in the sample.

3.2.2 ​Research Papers

1) ​A Tutorial on Independent Component Analysis​[7]

This paper was a major help in understanding the ICA algorithm.It has a detailed
description of both mathematical as well as physical aspect of ICA.
The paper is divided into subparts-

a)​Examples on Linear Mixed Signals​- This section explains about the various
type of linear mixtures which can be obtained during NIR spectroscopy with the
help of examples.
b)​Setup​- This part deals with the two main equations involved in ICA and also
explains the assumptions that are to be taken.

X=A*S ​where X=Mixture


A=Mixing Matrix
W=Unmixing Matrix
​S=Individual Components
​ =X*W
S
c)​A strategy for solving ICA​- This explains the entire logic that is being used
behind ICA which includes ​SVD,Covariance,Whitening filter and Statistical
Independence​.
With the help of SVD we can find the mixing matrix A if X is known ,whose
graphical representation is shown in Fig 8.
​A=UΣV​T

W=A​-1​=VΣ​-1​U​T (where U​-1​=U​T because U,V are rotation


matrices)

20

Fig8.​This figure ​ ​is a graphical depiction of SVD of the invertible matrix A

Also we find the covariance and the eigen vectors of the given data,

Covariance of X= ​< xx​T​>=UΣ​2​U​T


Covariance of X= ​< xx​T​>=EDE​T
​where columns of E are eigen vectors of cov(X) and D is a diagonal matrix whose
diagonal elements are eigen values

Comparing the above two equations we get,


​ W=VD​-1/2​E​T
Now we whiten the data given to us,
​X​w​ = (D​-1/2​E​T​)X
this gives us,
​ S=VX​w ​ (the ​ only that unknown is V)

We can find V using the concept of Statistical Independence i.e.,by minimising the
multi information function(a solution using Information Theory)

This research paper helped us in dividing ICA into three simple steps as follows-

i) Subtract off the mean of the data in each dimension


ii) Whiten the data by calculating the eigenvectors of the covariance of the data
iii) Identify nal rotation matrix that optimizes statistical independence

21
2) ​A New Approach to Near-Infrared Spectral Data Analysis Using ICA

This research paper has several case studies which make understanding ICA of the
NIR spectra much easier.
One of them is about a mixture of starch,water ,protein discussed earlier in section
3.2.1.
The other case study is about a moisture,protein and fat mixture whose mixed
spectra is shown in Fig 9.1.The data was recorded on a Tecator Infratec Food and
Feed Analyzer working in the wavelength range 850-1050 nm by the near-infrared
transmission (NIT) principle.​[3]​ ​The IC’s of the mixture are displayed in Fig 9.2.

Fig9.1 The spectra of 10 samples ​Fig9.2 The spectra of the 3 IC’s -


moisture,protein and fat

22
This paper also deals with a case study that discusses about ​derivative
spectroscopy ​i.e. performing ICA on second order derivative data for the same
mixture of starch,water and protein.In this case,the second order derivative
references are obtained as shown in Fig 10.1.Mixture data(Fig 10.2) is taken and
ICA is performed on it to obtain the three independent components in Fig 10.3.

Fig10.1 ​The second derivative spectra of ​Fig10.2 ​The second derivative spectra
water,starch and protein of 10 samples

Fig10.3 ​The separated IC’s from Fig10.2

23
3) ​The Anomaly Mixed Spectrum Signals Detection based on ICA and

KNN(Jin Lu, Huang Ming, Jingjing Yang)​ [8]

This paper deals with separating the radio interference signals using ​time -domain
ICA .​It also proposes an anomaly detection method based on ICA and KNN.

The mixing process is same as in ICA but ​FastICA is used in the case of
unmixing.The detection flowchart for the process is shown in Fig11.

Fig11.​Flowchart describing the whole process​[8]

X=A[S(t)+N(t)] ​where A=mixing matrix


S=original signal
X=observed signal

Now, the unmixing system is based on Fast ICA algorithm of Information


Criterion.This algorithm consists of two processes namely,albefaction and
orthogonal separation.
Albefaction-
​Y=V*X ​where V=linear transformation matrix

Let Cx be the covariance matrix of X.We can calculate V using the following
formula:

​V=(E​value​)​-0.5​(E​vector​)​T​ ​where​ ​E​value =Eigen


​ value of C​X
E​vector =Eigen
​ vector of C​X
T ​
E​vector​ =Transpose of E​vector

24
Orthogonal Separation-

​Z=W*Y

For calculating W we use Fast ICA algorithm with maximum entropy


criterion.Calculating W is a rapid iterative process we need to optimise the
process.Hence following steps are carried out -
Assuming that every dimension of W is W​P,​ we first normalise W​p
W​P​ = W​P​/||W​P​||
In order to limit the number of iterations of W​P​, parameter Maxcount will be set to
control the iterative process.

3.2.3 ​Videos and other reliable sources ​-

1)For better understanding of machine learning we started watching online lectures


of a professor at Stanford University- Mr.Andrew Ng.
● Machine Learning​[9]​- This is a collection of 20 lectures which helped us in
understanding all other machine learning algorithms including ICA and
PCA.

2)We have also watched videos about how to input data for ICA in MATLAB
​ - This video helped us in removing errors from our
● Input Data for ICA [10]​
code and plotting the data.

3)Also,we read through a few articles on the internet about NIR,BSS,ICA,PCA etc.

25
3.3 ​Design of a matlab code

​ fter having understood how ICA works,we were able to build up on an existing
A
matlab code.The basic code had to be modified in order to suit to the data that we
were working on.The designed code works on ​FOBI algorithm(in Appendix B)
which not only considers the second order correlations but also the higher order
statistics.
The ICA code(Appendix B) includes the following steps-
i)normalising (mean-centering) the preprocessed data
ii)finding the eigen values and eigen vectors of the mixture samples
iii)discovering the mixing and unmixing matrices
iv)checking the data for kurtosis,linearity and non-gaussianity
Finally the independent components are found out using the formulae for ICA.

3.4 ​Obtaining the graphical data and correlation coefficients to


draw conclusions
After obtaining the IC’s in the sample whose number will be equal to the number
of mixture samples,the next job is to discern the IC’s that are actually present in the
mixture.In order to do this,two techniques are used-

1)Obtaining the plots and comparing the graphical data


2)Mathematical Spectral Similarity Measures

3.4.1 ​Plots of the IC’s and the estimated mixture data

The plots of IC’s and estimated mixture data is obtained by using a matlab code
mentioned in Appendix C.The given code calls the function and plots the graphs
using subplot command in MATLAB.
After manual examination of the plots,the true IC’s are selected and reconstructed
in order to obtain the ‘estimated mixture’.In case the references are not known ,all
possible permutations and combinations of IC’s are tried and various kinds of
mixtures are obtained.These estimated mixtures are now plotted along with the
actual mixtures to obtain the final IC’s.

26
3.4.2​ Mathematical Spectral Similarity Measures

The second method employed is the spectral similarity measures.Some of the


common methods are listed below-
i)​Spectral Angle Mapper​- Spectral angle mapping (SAM) is a common, powerful
spectral recognition technique, where an unknown spectra is compared to a
reference spectra. The spectras are treated as vectors, their dimensionality equal to
the number of bands, n.​[11]
The cosine of the angle between two vectors a and b is given by
​cos(θ)= a·b/ ||a||.||b||
The result lies between -1 and +1,
cos(θ) = +1 means parallel vector pointing in the same direction
= -1 means parallel vector pointing in the opp. direction
= 0 means the vectors are perpendicular

ii)​Euclidean Distance​-The Euclidean distance has an intuitive appeal as it is


commonly used to evaluate the proximity of objects in 2D/3D space.

iii)​Dynamic Time Warping - Dynamic Time Warping is an algorithm initially


used for speech recoginition.It aims at aligning two sequences of feature vectors by
warping the time axis iteratively until an optimal match (according to a suitable
metrics) between the two sequences is found.​[12]

iv)​Pearson Correlation Coefficient​-​Correlation coefficients are used in statistics


to measure how strong a relationship is between two variables. Pearson’s
correlation (also called Pearson’s ​R)​ is a correlation coefficient commonly used in
linear regression. When conducting a statistical test between two variables, it is a
good idea to consider the Pearson correlation coefficient value to determine the
strength of the relationship between the two variables. Below is the formula used to
compute the correlation coefficient.

27
Fig12. ​Graphical Interpretation of Correlation Coefficient

What does a negative correlation coefficient mean?


For any two variables P and Q, a negative correlation coefficient means that an
increase in P is associated with a decrease in Q.A negative correlation coefficient
describes a connection between two variables in the same way as a positive
correlation coefficient.The relative strengths are basically the same.Fig 12 depicts
the graphical representation of the Pearson correlation coefficient.

From the above mentioned methods we finally used ​Pearson Correlation


Coefficient ​for measuring the spectral similarities.

We wrote a MATLAB code(given in Appendix D) for finding the correlation


coefficient between any two variables using the formula mentioned above.

28
4.Results and Discussions

4.1 ​Outputs of the MATLAB code ​( mentioned in Appendix C)

4.1.1 ​Input Data 1


This given mixture data is for a drug named​ “TRIPHALA”​.
Triphala is an ayurvedic herbal formation consisting of three main ingredients
-Amalaki (Emblica officinalis), Bibhitaki (Terminalia belerica) and Haritaki
(Terminalia chebula).The spectral data for these three ingredients is taken as the
reference.​[14]
The data given below is fed to the code in Appendix B to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 6x2201.

Fig13.1 ​(6x2201)tp_mix

OUTPUT PLOTS-
Now,giving the above mixture data as input to independent component analysis
and implementing the matlab code would result in obtaining the following plots.

Fig 13.2 ​Plot of mixture(mix(1)-mix(6)) and estimated mixture data(x(1)-x(6))

29
Fig 13.3 ​Plot of the references(sref(1)-sref(3)) and IC’s(s(1)-s(6))

Interpretation​-

I​ n Fig 13.2 above,the upper six graphs are the spectra of the mixture.We notice
that they are almost similar with just a shift upwards/downwards because they are
all the mixtures of the same independent components with a distinction in the
packing fraction and density of each layer. The absorption range is from 0 to 1 and
the wavelength range from 0 to 2500.
The lower six graphs are the spectra of the estimated mixture.The wavelength
range remains same for both but the amplitude varies a bit.In the last 2 plots there’s
a great variation in the magnitude due to noise interference. The comparison of the
actual and the estimated mixtures is to be accomplished in the remaining part of the
project

In Fig 13.3,the above 3 plots are the reference spectra of the independent
components.In this case we already know the references from before.And applying
the formulas for ICA results in finding the IC’s which are plotted in the lower
part(the last 6) of Fig 13.3.

30
4.1.2 ​Input Data 2

Finally ,we worked on a herbal medicine called ​“TRIKATU”​.


Trikatu is a classic ayurvedic herbal mixture of Black pepper, Long pepper and
Ginger. Two of the main ingredients black pepper and long pepper share many
chemical characteristics hence their NIR spectra look quite similar.​[15]

Problem Statement​-Five market samples of the same drug trikatu are provided
along with the three references.Determine the most authentic market sample and
the chemical composition of the independent components.

The data consists of 4 market samples - IMP,AAV,ZIG,AAH among which IMP is


considered to be the market leader.Therefore,a code has been designed(given in
Appendix E) to find the IC’s of IMP ,validate them with the references,reconstruct
for second validation process and finally use the selected IC’s as references for the
remaining market samples.After selecting the new references they were used to
assess the quality of remaining market samples.

31
1)​IMP​ -

Mixture data of IMP​ -

This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.

Fig14.1​ (5x2201)imp

Fig14.2 ​Plot of IMP mixture data

32
Independent Component(s) of IMP​ -

Fig14.3 ​(5x2201)s_imp

Fig14.4 ​Plot of IC’s of IMP

Correlation Coefficients with three references​ -

Fig14.5 ​(5x3)r_imp (Rows-IC,Columns-References)

33
Based upon the values of Correlation Coefficient we have assumed IC2,IC3,IC4 as
the three Independent Components(IC’s)

​Fig14.6 ​Plot of the three IC’s with corresponding references

34
After reconstruction using IC2,IC3,IC4 we get the following data -

Fig14.7 ​(5x2201)x_imp

Fig14.8 ​Plot of estimated mixture data of IMP

Correlation Coefficient of x_imp with imp​ -

Fig14.9 ​(5x5) r1_imp(Rows- estimated mixture,Columns-original mixture)

35
Discussion:​After obtaining five IC’s by performing ICA on the IMP sample,we
constructed the correlation coefficient between the IC’s and the references(Fig
14.5).From the values of correlation coefficients, three IC’s that closely match the
references(CC nearing 1) were predicted.In order to validate these three IC’s ,we
reconstructed and compared them with the actual IMP mixture which resulted in
another matrix of correlation coefficients as shown in Fig 14.9.Hence ,as the
correlation coefficients are nearly equal to 1,we have considered​ IC2,IC3 and IC4
to be the references for the rest of the market samples.Therefore the new references
for the other market samples are ​IC2(R1),IC4(R2),IC3(R3)

Fig14.10 ​(3x2201)s1_imp ( new references)

​Fig14.11​ Plot of the new references

36
2)​AAV​ -

Mixture data of AAV ​ -

This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.

Fig15.1 ​(5x2201) test_aav

Fig15.2 ​Plot of AAV mixture data

37
Independent Component(s) of AAV ​-

Fig15.3 ​(5x2201)s_aav

Fig15.4​ Plot of IC’s of AAV

Correlation Coefficient of above IC’s with new reference​ -

Fig15.5 ​(5x3)r_aav (Rows-IC,Columns-References)

38
Based upon the values of Correlation Coefficient we have chosen IC4, IC1,IC5 as
the Independent Components of AAV.

Fig15.6 ​Plot of the three IC’s with the references

39
After reconstruction using IC1,IC4,IC5 we get the following data -

Fig15.7 ​(5x2201) x_aav

Fig15.8 ​Plot of the estimated mixture data of AAV


Correlation Coefficient of x_aav with imp ​-

Fig15.9 ​(5x5)r1_aav(Rows- estimated mixture,Columns-original mixture)

40
Discussion:​From Fig 15.5 ,we have chosen three IC’s namely, IC4,IC1 and IC5
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
15.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 15.9).From Fig 15.9,we can infer that AAV is related to IMP with
an average strength of around 0.76

41
3) ​ZIG​ -

Mixture data of ZIG ​-

This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.

Fig16.1​ (5x2201) test_zig

Fig16.2 ​Plot of ZIG mixture data

42
Independent Component(s) of ZIG​ -

Fig16.3​ (5x2201)s_zig

Fig16.4 ​Plot of IC’s of ZIG

Correlation Coefficient of above IC’s with new reference ​-

Fig16.5​ (5x3)r_zig(Rows-IC,Columns-References)

43
Based upon the values of Correlation Coefficient we have chosen IC3, IC2,IC4 as
the Independent Components of ZIG.

Fig16.6​ Plot of the three IC’s with the references

44
After reconstruction using IC1,IC4,IC5 we get the following data -

Fig16.7​ (5x2201) x_zig

Fig16.8​ Plot of the estimated mixture data of ZIG

Correlation Coefficient of x_zig with imp​ -

Fig16.9​ (5x5)r1_zig(Rows- estimated mixture,Columns-original mixture)

45
Discussion:​From Fig 16.5 ,we have chosen three IC’s namely, IC3,IC1 and IC4
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
16.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 15.9).From Fig 16.9,we can infer that ZIG is related to IMP with
an average strength of around 0.91.

46
4) ​AAH​ -

Mixture data of AAH ​ -

This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.

Fig17.1 ​(5x2201) test_aah

Fig17.2 ​Plot of AAH mixture data

47
Independent Component(s) of AAH​-

Fig17.3 ​(5x2201)s_aah

Fig17.4​ Plot of IC’s of AAH

Correlation Coefficient of above IC’s with new reference​ -

Fig21.5 ​(5x3)r_aah(Rows-IC,Columns-References)

48
Based upon the values of Correlation Coefficient we have chosen IC4, IC1,IC3 as
the Independent Components of AAH.

Fig17.6 ​Plot of the three IC’s with the references

49
After reconstruction using IC4,IC1,IC3 we get the following data -

Fig17.7 ​(5x2201) x_aah

Fig17.8 ​Plot of the predicted mixture data of AAH


Correlation Coefficient of x_aah with imp ​-

Fig17.9 ​(5x5)r1_aah(Rows- estimated mixture,Columns-original mixture)

50
Discussion:​From Fig 17.5 ,we have chosen three IC’s namely, IC4,IC1 and IC3
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
17.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 17.9).From Fig 17.9,we can infer that AAH is related to IMP with
an average strength of around 0.99

51
5.CONCLUSION

Extract and study the independent components in a herbal sample by


examining it’s NIR spectra. ​In this project,we have developed a code that is
capable of performing ICA on a provided NIR spectra obtained from a
spectrometer .This code is based on FOBI algorithm which processes the sample in
order to obtain statistically independent components.This can be extrapolated for
other drugs with similar properties.These independent components are validated
with the help of reconstruction of various combinations of IC’s.A validation
technique called “Pearson Correlation Coefficient” has been employed in the
process.After authenticating the IC’s ,the ingredients in each of them are examined
and sent for verification by the chemist.

52
6.RECOMMENDATIONS

The use of ICA has been of great help to identify the independent components in a
herbal mixture.However when the number of independent components are greater
than the number of mixture samples,ICA is able to identify only the dominant IC’s
which have higher magnitude.The use of other techniques like PCA might address
this problem.

Pearson correlation coefficient has been used for comparison of the spectra which
sometimes gives ambiguous results such as low or similar correlation coefficient
values.

53
7.REFERENCES

1. Near Infrared Spectroscopy: fundamentals, practical aspects and analytical


applications
2. Advancing the Application of NIR Spectroscopy
3. A New Approach to Near-Infrared Spectral Data Analysis Using
Independent Component Analysis
4. Independent Component Analysis: Algorithms and Applications
5. Baseline Correction of Diffuse Reflection Near-Infrared Spectra
6. Determination of Cellulose Crystallinity of Banana Residues Using Near
Infrared Spectroscopy and Multivariate Analysis
7. A Tutorial on Independent Component Analysis
8. The Anomaly Mixed Spectrum Signals Detection Based on ICA and KNN
Jin Lu1,a, Huang Ming1,b, Jingjing Yang1,c
9. Machine Learning
10.Input Data to ICA
11.Comparison of Principal Component Analysis and Spectral Angle Mapping
for Identification of Materials in Terahertz Transmission Measurements
12.DTW Algorithm
13.ICA-Applications
14.Triphala
15.Trikatu

54
8.APPENDIX
8.1 ​APPENDIX A

MODEL FOR ICA

8.2 ​APPENDIX B

MATLAB CODE FOR ICA


function [W, S] = ica(X) ​% ICA Perform independent component analysis.
[W, S] = ica(X); ​% where X = AS and WA = eye(d)
[d,n] = size(X); ​%(d dims, n samples)
% Implements​ ​FOBI algorithm​.
% Subtract off the mean of each dimension.
X = X - repmat(mean(X,2),1,n);
% Calculate the whitening filter.
[E, D] = eig(cov(X'));
% Whiten the data
X_w = sqrtm(pinv(D))*E'*X;
% Calculate the rotation that aligns with the % directions which maximize
fourth-order % correlations.
[V,s,u] = svd((repmat(sum(X_w.*X_w,1),d,1).*X_w)*X_w');
% Compute the inverse of A.
W = V * sqrtm(pinv(D)) * E';
% Recover the original sources.
S = W * X;
end

55
8.3 ​APPENDIX C

MATLAB CODE FOR THE PLOTS

[w_unmixing_matrix,ic_mkt_sample]=ica(mkt_sample);
a_mixing_matrix=inv(w_unmixing_matrix);
est_mkt_sample=a_mixing_matrix*ic_mkt_sample;

subplot(5,3,1);
plot(est_mkt_sample(1,:),'b');
subplot(5,3,2);
plot(est_mkt_sample(2,:),'g');
subplot(5,3,3);
plot(est_mkt_sample(3,:),'cyan');
subplot(5,3,4);
plot(est_mkt_sample(4,:),'magenta');
subplot(5,3,5);
plot(est_mkt_sample(5,:),'y');
subplot(5,3,6);
plot(est_mkt_sample(6,:),'o');
subplot(5,3,7);
plot(ref_mkt_sample(2,:));
subplot(5,3,8);
plot(ref_mkt_sample(5,:));
subplot(5,3,9);
plot(ref_mkt_sample(8,:));
% hold on;
subplot(5,3,10);
plot(ic_mkt_sample(1,:));
% hold on;
subplot(5,3,11);
plot(ic_mkt_sample(2,:));
subplot(5,3,12);
plot(ic_mkt_sample(3,:));
subplot(5,3,13);
plot(ic_mkt_sample(4,:));
subplot(5,3,14);

56
plot(ic_mkt_sample(5,:));
subplot(5,3,15);
plot(ic_mkt_sample(6,:));

8.4​ APPENDIX D

MATLAB CODE FOR CALCULATING PEARSON COEFFICIENT

for j=1:row
for i=1:column
nr=nr+((ic_mkt_sample(j,i)-ic_mkt_samplem(j,1)).*(ref_mkt_sample(k,i)-ref_mkt
_samplem(1,1)));
dr1=dr1+((ic_mkt_sample(j,i)-ic_mkt_samplem(j,1)).^2);
dr2=dr2+((ref_mkt_sample(k,i)-ref_mkt_samplem(1,1)).^2);
end
corr_coeff(j,k)=(((nr).^2)/(dr1.*dr2))^0.5;
nr=0;dr1=0;dr2=0;
end

8.5 ​APPENDIX E

COMPLETE MATLAB CODE

clc;
[w,ic_mkt_sample]=ica(mkt_sample);

a=inv(w);
est_mkt_sample= a*ic_mkt_sample;

[r,c]=size(ic_mkt_sample);
[r1,c1] =size(ref_mkt_sample);

cc = [];

for i=1:r
for j = 1:r1

57
cc = corrcoef(ic_mkt_sample(i,:),ref_mkt_sample(j,:));
corr_coeff(i,j) = abs(cc(1,2));
end
end
pearson_coeff = corr_coeff;

arr=[];
for j=1:r1
[v1,row]=max(corr_coeff(:));
[ir,ic]=ind2sub(size(corr_coeff),row);
arr(1,ic)=ir;
ic1_mkt_sample(j,:)=ic_mkt_sample(ir,:);
a1(:,j)=a(:,ir);
corr_coeff(ir,:)=0;
corr_coeff(:,ic)=0;
v1=0;
end
est_mkt_sample= a1*ic1_mkt_sample;

[r2 c2] = size(mkt_sample);


[r3 c3] = size(est_mkt_sample);

cc_mp = [];

for i=1:r2
for j = 1:r3
cc_mp = corrcoef(est_mkt_sample(i,:),mkt_sample(j,:));
corr_coeff_mp(i,j) = abs(cc_mp(1,2));
end
end

58

Vous aimerez peut-être aussi