Académique Documents
Professionnel Documents
Culture Documents
ON
BY
1
A REPORT
ON
BY
AT
2
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE
PILANI (RAJASTHAN)
Practice School Division
Duration: 8 Weeks
ID No./Name/
Discipline of the student: 2016A8PS0444H,Md Neha Tabassum,Electronics & Instrumentation
Project Title: Extraction of chemical components from herbal spectra using Independent
Component Analysis
Abstract:.The need for the identification of hidden components in an adulterated mixture and to
qualitatively assess various medicines has been growing day by day.With the advent of machine
learning and signal analytics,the task has been made fast and facile .The aim is to automate the
process of identifying the hidden individual components in the NIR spectra of the mixture.W e
proposed an implementation to perform Independent Component Analysis using MATLAB.The
code is capable of finding the independent components of a mixture.Subsequently the plots of
independent components are obtained and spectral similarity techniques are applied to identify
the actual independent components..This can help in building an effective as well as efficient
on-line instrument for quality assessment in herbal medicine industry.
3
ACKNOWLEDGEMENTS
I would express my appreciation and gratitude for the help and guidance of my
Project co-ordinator, Mr. C. Kumaravelu (Senior Principal Scientist), for helping
me throughout the project and investing his precious time and inputs on my
project.
I would also like to thank the current Scientist-In-Charge Dr. A. Gopal at CEERI
Chennai, CMC,Tharamani, for providing me this opportunity to work at such an
esteemed institute on a project for my PS1.
There are many more people like the employees, support staff and others who have
ensured an enjoyable time spent at the offices and laboratories of CEERI, allowing
us the freedom and space to work without distractions. We would like to thank
them all. We regret our inability to mention the names of everyone.
I would also like to acknowledge the help and support shown towards me by
Mr.D.Palaniandi (Student Project Co-ordinator), and especially the warmth
provided by him assisted us in having an enjoyable time in a new city at such a
prestigious institute.
Thank you.
4
TABLE OF CONTENTS
List of Abbreviations 7
1 INTRODUCTION
1.1.1 Theory 9
1.1.2 Applications 10
1.2.5 ICA 13
1.3.1 Theory 13
1.3.2 Applications 15
PROJECT
3 METHODOLOGY 17
5
3.2.1 Discussion with our scientist 18
5 CONCLUSIONS 52
6 RECOMMENDATION 53
7 REFERENCES 54
8 APPENDIX
8.1 APPENDIX A
8.2 APPENDIX B
8.3 APPENDIX C
8.4 APPENDIX D
8.5 APPENDIX E
6
LIST OF ABBREVIATIONS
7
CEERI Chennai
8
1.INTRODUCTION
1.1.1 Theory
9
1.1.2 Applications
10
1.2 Methods used for NIR spectral analysis
The PCR method combines principal component analysis (PCA)of the spectra with
an MLR to devise a quantitative model for complex samples.
PCA decomposes the original spectra matrix into several principal components,
A=SF where A=Spectra matrix
S=Scaling score matrix
F=Principal component matrix
Here the S matrix is supposed to be linearly correlated to the concentration matrix
C, then it is the next step to regress C against the S matrix by the MLR model
C = SB
Therefore it can be solved that
B= (STS)-1STC
For unknown samples, A, B, and F are known and S is obtained from
S= AFT
and then the concentration matrix C can be obtained.As a modeling method, PCA
is somewhat less accurate than MLR when both are used within the calibration
11
range and when the model is indeed linear. However it is often more reliable than
the MLR if extrapolation is needed.[3]
The purpose of PLS is to build the linear model between the concentration matrix
C and the spectral matrix A,
C =AB
In comparison to the MLR method and PCR methods, PLS produces a weight
matrix W for A such that
S=AW
, i.e., the columns of W are weight vectors for the A columns producing the
corresponding score matrix S.
These weights are computed so that each of them maximizes the covariance
between the concentration matrix C and the spectra matrix A. C can then be
decomposed as
C =SF where F =Loading matrix for C
Once F is computed, there is
C =AB where B =WF
PLS adopts a single-step decomposition and regression technique and therefore is
faster than PCA although it is more abstract and difficult to explain.However it
uses the concentration information during the decomposition process, and this
causes the spectra containing higher constituent concentrations to be weighted
more heavily than those with low concentrations.[3]
ourier analysis transforms the spectral data in the wavelength domain into the
F
frequency domain, where the original spectra are approximated to a desired degree
of accuracy by sums of periodic sine and cosine functions of increasing frequency.
Wavelet analysis has recently been studied in NIR spectra data processing. It
proves to be more powerful than Fourier transform in capturing the local features
of a spectrum because it can decompose a signal into components that are well
localized in both the time and frequency domains, while Fourier transform can
only reflect upon the frequency information.
However the foregoing reviewed methods are mainly for calibration and prediction
purposes, and none is designed for the purpose of blind source spectra
separation.[3]
12
1.2.5 Independent Component Analysis
1.3.1 Theory
ICA belongs to a class of blind source separation (BSS) methods for separating
data into underlying informational components, where such data can take the form
of images, sounds, telecommunication channels,spectral signals or stock market
prices.It was developed in the 1990s. The term “blind” is intended to imply that
such methods can separate data into source signals even if very little is known
about the nature of those source signals.The goal of ICA is to solve BSS problems
which arise from a linear mixture.
13
The most basic example of ICA is the Cocktail Party problem which is clearly
explained in Fig 3.
Fig3. This figure explains how can we separate two independent source
signals from the mixed signals obtained in the two microphones.
14
1.3.2 Applications
ICA has been applied to problems in elds as diverse as speech processing, brain
imaging(e.g.,fMRI and optical imaging)electrical brain signals(e.g.,EEG signals),
telecommunications, and stock market prediction. However, because independent
component analysis is an evolving method which is being actively researched
around the world, the limits of what ICA may be good for have yet to be fully
explored.[13]
The achievements in the field of ICA has made life much easier.Some of the major
applications are listed below-
15
2. Aim of the project
Our basic interest is to analyse the NIR spectra of any given mixture(as shown in
Fig4.) and to discover the various independent components(as shown in Fig4.1
and Fig 4.2) as well as their concentrations by using ICA technique in
MATLAB.This is a blind source problem.Hence we cannot work with the usual
methods of MLR,PLS,Fourier and Wavelet transforms which require the
knowledge of references beforehand. However, ICA is an efficient technique
which can counteract this problem and is capable of removing the noise more
effectively without damaging the pure signals.
After reconstructing various combinations of IC’s and comparing them with the
actual mixture,we will be able to determine the real independent components along
with the adulterants which is the primary purpose of this project.
With the help of this project we should be able to obtain the following 2 individual
spectras-
Fig 4.1 NIR spectra of component A Fig 4.2 NIR spectra of component B
16
3.Methodology
The following sequence of steps were followed in order to find and authenticate
the independent components in any given herbal medicine -
Fig5.1 NIR spectra of a sample Fig5.2 NIR spectra of corrected data
17
There are complex mathematical formulae involved to perform these corrections
but we used a software called ‘UNSCRAMBLER’ which automatically gives the
corrected data once we feed the unprocessed data.
Sometimes to increase the accuracy of the method ,the second order derivative of
the raw data(Fig 6.1) can be considered as it makes the peak sharper and much
easier to take the information.Also,the signal to noise ratio increases as shown in
Fig 6.2.
Fig6.1 NIR spectra of banana residues[6] Fig6.2 Second derivative spectra of the
banana residues[6]
Initially,a brief description of the ongoing research was given by the scientist on
the day of orientation.Different MATLAB codes had to be written in order to
analyse the NIR spectra.All this was explained with the help of an example.The
example was about the spectral analysis of a mixture of water, starch, and protein.
Though in practical application, the constituent components of mixtures may not
be known and the ICA approach is supposed to identify these individual
components through blind separation of their spectra, in the example here, the
mixtures of known components are used to illustrate and test the proposed method.
18
Independent Component Analysis is a technique which can be applied on any
mixture data to obtain its Independent Components.Here, ICA is applied to a set of
data containing 10 samples of a starch,protein and water mixture as shown in
Fig7.2.The IC’s shown in Fig7.3 can be obtained using the code given in Appendix
B ,they can be compared with the references shown in Fig7.1.
Fig7.2 NIR spectra of the mixture[3] Fig7.3 NIR spectra of the IC’s obtained
on ICA[3]
19
In another example different sugar samples containing varied amount of sugar and
water were prepared.Further, NIR spectra of each sample was examined and
independent component analysis was performed on them.After getting the spectra
of IC’s and comparing it with the reference spectra of sugar and water it was much
easier to assess the C-H,O-H,C-C ,C=C and C=O bonds in the sample.
This paper was a major help in understanding the ICA algorithm.It has a detailed
description of both mathematical as well as physical aspect of ICA.
The paper is divided into subparts-
a)Examples on Linear Mixed Signals- This section explains about the various
type of linear mixtures which can be obtained during NIR spectroscopy with the
help of examples.
b)Setup- This part deals with the two main equations involved in ICA and also
explains the assumptions that are to be taken.
20
Fig8.This figure is a graphical depiction of SVD of the invertible matrix A
Also we find the covariance and the eigen vectors of the given data,
We can find V using the concept of Statistical Independence i.e.,by minimising the
multi information function(a solution using Information Theory)
This research paper helped us in dividing ICA into three simple steps as follows-
21
2) A New Approach to Near-Infrared Spectral Data Analysis Using ICA
This research paper has several case studies which make understanding ICA of the
NIR spectra much easier.
One of them is about a mixture of starch,water ,protein discussed earlier in section
3.2.1.
The other case study is about a moisture,protein and fat mixture whose mixed
spectra is shown in Fig 9.1.The data was recorded on a Tecator Infratec Food and
Feed Analyzer working in the wavelength range 850-1050 nm by the near-infrared
transmission (NIT) principle.[3] The IC’s of the mixture are displayed in Fig 9.2.
22
This paper also deals with a case study that discusses about derivative
spectroscopy i.e. performing ICA on second order derivative data for the same
mixture of starch,water and protein.In this case,the second order derivative
references are obtained as shown in Fig 10.1.Mixture data(Fig 10.2) is taken and
ICA is performed on it to obtain the three independent components in Fig 10.3.
Fig10.1 The second derivative spectra of Fig10.2 The second derivative spectra
water,starch and protein of 10 samples
23
3) The Anomaly Mixed Spectrum Signals Detection based on ICA and
KNN(Jin Lu, Huang Ming, Jingjing Yang) [8]
This paper deals with separating the radio interference signals using time -domain
ICA .It also proposes an anomaly detection method based on ICA and KNN.
The mixing process is same as in ICA but FastICA is used in the case of
unmixing.The detection flowchart for the process is shown in Fig11.
Let Cx be the covariance matrix of X.We can calculate V using the following
formula:
24
Orthogonal Separation-
Z=W*Y
2)We have also watched videos about how to input data for ICA in MATLAB
- This video helped us in removing errors from our
● Input Data for ICA [10]
code and plotting the data.
3)Also,we read through a few articles on the internet about NIR,BSS,ICA,PCA etc.
25
3.3 Design of a matlab code
fter having understood how ICA works,we were able to build up on an existing
A
matlab code.The basic code had to be modified in order to suit to the data that we
were working on.The designed code works on FOBI algorithm(in Appendix B)
which not only considers the second order correlations but also the higher order
statistics.
The ICA code(Appendix B) includes the following steps-
i)normalising (mean-centering) the preprocessed data
ii)finding the eigen values and eigen vectors of the mixture samples
iii)discovering the mixing and unmixing matrices
iv)checking the data for kurtosis,linearity and non-gaussianity
Finally the independent components are found out using the formulae for ICA.
The plots of IC’s and estimated mixture data is obtained by using a matlab code
mentioned in Appendix C.The given code calls the function and plots the graphs
using subplot command in MATLAB.
After manual examination of the plots,the true IC’s are selected and reconstructed
in order to obtain the ‘estimated mixture’.In case the references are not known ,all
possible permutations and combinations of IC’s are tried and various kinds of
mixtures are obtained.These estimated mixtures are now plotted along with the
actual mixtures to obtain the final IC’s.
26
3.4.2 Mathematical Spectral Similarity Measures
27
Fig12. Graphical Interpretation of Correlation Coefficient
28
4.Results and Discussions
Fig13.1 (6x2201)tp_mix
OUTPUT PLOTS-
Now,giving the above mixture data as input to independent component analysis
and implementing the matlab code would result in obtaining the following plots.
29
Fig 13.3 Plot of the references(sref(1)-sref(3)) and IC’s(s(1)-s(6))
Interpretation-
I n Fig 13.2 above,the upper six graphs are the spectra of the mixture.We notice
that they are almost similar with just a shift upwards/downwards because they are
all the mixtures of the same independent components with a distinction in the
packing fraction and density of each layer. The absorption range is from 0 to 1 and
the wavelength range from 0 to 2500.
The lower six graphs are the spectra of the estimated mixture.The wavelength
range remains same for both but the amplitude varies a bit.In the last 2 plots there’s
a great variation in the magnitude due to noise interference. The comparison of the
actual and the estimated mixtures is to be accomplished in the remaining part of the
project
In Fig 13.3,the above 3 plots are the reference spectra of the independent
components.In this case we already know the references from before.And applying
the formulas for ICA results in finding the IC’s which are plotted in the lower
part(the last 6) of Fig 13.3.
30
4.1.2 Input Data 2
Problem Statement-Five market samples of the same drug trikatu are provided
along with the three references.Determine the most authentic market sample and
the chemical composition of the independent components.
31
1)IMP -
This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.
Fig14.1 (5x2201)imp
32
Independent Component(s) of IMP -
Fig14.3 (5x2201)s_imp
33
Based upon the values of Correlation Coefficient we have assumed IC2,IC3,IC4 as
the three Independent Components(IC’s)
34
After reconstruction using IC2,IC3,IC4 we get the following data -
Fig14.7 (5x2201)x_imp
35
Discussion:After obtaining five IC’s by performing ICA on the IMP sample,we
constructed the correlation coefficient between the IC’s and the references(Fig
14.5).From the values of correlation coefficients, three IC’s that closely match the
references(CC nearing 1) were predicted.In order to validate these three IC’s ,we
reconstructed and compared them with the actual IMP mixture which resulted in
another matrix of correlation coefficients as shown in Fig 14.9.Hence ,as the
correlation coefficients are nearly equal to 1,we have considered IC2,IC3 and IC4
to be the references for the rest of the market samples.Therefore the new references
for the other market samples are IC2(R1),IC4(R2),IC3(R3)
36
2)AAV -
This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.
37
Independent Component(s) of AAV -
Fig15.3 (5x2201)s_aav
38
Based upon the values of Correlation Coefficient we have chosen IC4, IC1,IC5 as
the Independent Components of AAV.
39
After reconstruction using IC1,IC4,IC5 we get the following data -
40
Discussion:From Fig 15.5 ,we have chosen three IC’s namely, IC4,IC1 and IC5
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
15.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 15.9).From Fig 15.9,we can infer that AAV is related to IMP with
an average strength of around 0.76
41
3) ZIG -
This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.
42
Independent Component(s) of ZIG -
Fig16.3 (5x2201)s_zig
Fig16.5 (5x3)r_zig(Rows-IC,Columns-References)
43
Based upon the values of Correlation Coefficient we have chosen IC3, IC2,IC4 as
the Independent Components of ZIG.
44
After reconstruction using IC1,IC4,IC5 we get the following data -
45
Discussion:From Fig 16.5 ,we have chosen three IC’s namely, IC3,IC1 and IC4
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
16.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 15.9).From Fig 16.9,we can infer that ZIG is related to IMP with
an average strength of around 0.91.
46
4) AAH -
This is the mixture data that is given as an input to perform ICA.Each row
indicates the absorption of a specific mixture at different wavelengths.For
instance,row 1 is mixture 1 and each column in row 1 is the absorption value of
that mixture at that wavelength.The dimensions of the data are 5x2201.
47
Independent Component(s) of AAH-
Fig17.3 (5x2201)s_aah
Fig21.5 (5x3)r_aah(Rows-IC,Columns-References)
48
Based upon the values of Correlation Coefficient we have chosen IC4, IC1,IC3 as
the Independent Components of AAH.
49
After reconstruction using IC4,IC1,IC3 we get the following data -
50
Discussion:From Fig 17.5 ,we have chosen three IC’s namely, IC4,IC1 and IC3
which give the highest correlation coefficient values with R1 ,R2 and R3
respectively.These three IC’s are reconstructed to obtain the estimated mixture (Fig
17.7) and compared with IMP mixture(Fig 14.3) to obtain a matrix of correlation
coefficients(Fig 17.9).From Fig 17.9,we can infer that AAH is related to IMP with
an average strength of around 0.99
51
5.CONCLUSION
52
6.RECOMMENDATIONS
The use of ICA has been of great help to identify the independent components in a
herbal mixture.However when the number of independent components are greater
than the number of mixture samples,ICA is able to identify only the dominant IC’s
which have higher magnitude.The use of other techniques like PCA might address
this problem.
Pearson correlation coefficient has been used for comparison of the spectra which
sometimes gives ambiguous results such as low or similar correlation coefficient
values.
53
7.REFERENCES
54
8.APPENDIX
8.1 APPENDIX A
8.2 APPENDIX B
55
8.3 APPENDIX C
[w_unmixing_matrix,ic_mkt_sample]=ica(mkt_sample);
a_mixing_matrix=inv(w_unmixing_matrix);
est_mkt_sample=a_mixing_matrix*ic_mkt_sample;
subplot(5,3,1);
plot(est_mkt_sample(1,:),'b');
subplot(5,3,2);
plot(est_mkt_sample(2,:),'g');
subplot(5,3,3);
plot(est_mkt_sample(3,:),'cyan');
subplot(5,3,4);
plot(est_mkt_sample(4,:),'magenta');
subplot(5,3,5);
plot(est_mkt_sample(5,:),'y');
subplot(5,3,6);
plot(est_mkt_sample(6,:),'o');
subplot(5,3,7);
plot(ref_mkt_sample(2,:));
subplot(5,3,8);
plot(ref_mkt_sample(5,:));
subplot(5,3,9);
plot(ref_mkt_sample(8,:));
% hold on;
subplot(5,3,10);
plot(ic_mkt_sample(1,:));
% hold on;
subplot(5,3,11);
plot(ic_mkt_sample(2,:));
subplot(5,3,12);
plot(ic_mkt_sample(3,:));
subplot(5,3,13);
plot(ic_mkt_sample(4,:));
subplot(5,3,14);
56
plot(ic_mkt_sample(5,:));
subplot(5,3,15);
plot(ic_mkt_sample(6,:));
8.4 APPENDIX D
for j=1:row
for i=1:column
nr=nr+((ic_mkt_sample(j,i)-ic_mkt_samplem(j,1)).*(ref_mkt_sample(k,i)-ref_mkt
_samplem(1,1)));
dr1=dr1+((ic_mkt_sample(j,i)-ic_mkt_samplem(j,1)).^2);
dr2=dr2+((ref_mkt_sample(k,i)-ref_mkt_samplem(1,1)).^2);
end
corr_coeff(j,k)=(((nr).^2)/(dr1.*dr2))^0.5;
nr=0;dr1=0;dr2=0;
end
8.5 APPENDIX E
clc;
[w,ic_mkt_sample]=ica(mkt_sample);
a=inv(w);
est_mkt_sample= a*ic_mkt_sample;
[r,c]=size(ic_mkt_sample);
[r1,c1] =size(ref_mkt_sample);
cc = [];
for i=1:r
for j = 1:r1
57
cc = corrcoef(ic_mkt_sample(i,:),ref_mkt_sample(j,:));
corr_coeff(i,j) = abs(cc(1,2));
end
end
pearson_coeff = corr_coeff;
arr=[];
for j=1:r1
[v1,row]=max(corr_coeff(:));
[ir,ic]=ind2sub(size(corr_coeff),row);
arr(1,ic)=ir;
ic1_mkt_sample(j,:)=ic_mkt_sample(ir,:);
a1(:,j)=a(:,ir);
corr_coeff(ir,:)=0;
corr_coeff(:,ic)=0;
v1=0;
end
est_mkt_sample= a1*ic1_mkt_sample;
cc_mp = [];
for i=1:r2
for j = 1:r3
cc_mp = corrcoef(est_mkt_sample(i,:),mkt_sample(j,:));
corr_coeff_mp(i,j) = abs(cc_mp(1,2));
end
end
58