Artigo Panel Check

Eur Food Res Technol (2010) 230:497511 DOI 10.
1007/s00217-009-1185-y
ORIGINAL PAPER
Analysing sensory panel performance in a prociency test using the PanelCheck software
Oliver Tomic Giorgio Luciano Asgeir Nilsen Grethe Hyldig Kirsten Lorensen Tormod Ns
Received: 5 May 2009 / Revised: 28 October 2009 / Accepted: 9 November 2009 / Published online: 2 December 2009 Springer-Verlag 2009
Abstract This paper discusses statistical methods and a workow strategy for comparing performance across multiple sensory panels that participated in a prociency test (also referred to as inter laboratory test). Performance comparison and analysis are based on a data set collected from 26 sensory panels carrying out proling on the same set of candy samples. The candy samples were produced according to an experimental design using design factors, such as sugar, and acid level. Because of the exceptionally large amount of data and the availability of multiple statistical and graphical tools in the PanelCheck software, a workow is proposed that guides the user through the data analysis process. This allows practitioners and non-statisticians to get an overview over panel performances in a rapid manner without the need to be familiar with details on the statistical methods. Visualisation of data analysis results plays an important role as this provides a time saving and efcient way of screening and investigating sensory panel performances. Most of the statistical methods used in this paper are available in the open source software PanelCheck, which may be downloaded and used for free.
Keywords Prociency test Inter laboratory test Sensory proling Performance visualisation PanelCheck
Introduction Trained sensory panels are important tools for assessing the quality of food and non-food products. There are, however, a number of problems related to the training, stability, and maintenance of the quality of such panels. A number of methods have been developed that may help to achieve better panel performance [15]. These techniques can detect lack of precision (repeatability), disagreement (reproducibility), and the ability or inability to discriminate between samples. This type of information is very useful for improving data quality in future sessions through increased and more targeted training on problematic issues. Larger companies maintaining sensory panels at multiple geographic locations are often subject to additional challenges. For example, thoroughly carried out quality control and product development require that all sensory panels are well calibrated with one another, eliminating potential shift between the panels and allowing for comparison of their results. When multiple sensory panels are to evaluate the same set of samples global performance issues (across multiple sensory panels) might add to already existing local performance issues (within one sensory panel). This further complicates comparison of results from each involved panel. Techniques for prociency tests are available, but most of them are developed for classical chemical inter-laboratory comparisons (see, e.g. [6]) and with less focus on some of the more specic aspects of sensory analysis such as those indicated above. Important contributions to the prociency test literature are available [79]. In these papers, classical ANOVA,
O. Tomic (&) G. Luciano A. Nilsen T. Ns s, Norway Noma Mat AS, Osloveien 1, 1430 A e-mail: oliver.tomic@noma.no G. Hyldig DTU Aqua, National Institute of Aquatic Resources, Technical University of Denmark, Sltofts Plads, Build. 221, 2800 Lyngby, Denmark K. Lorensen Chew Tech I/S, Vejle, Denmark
123
498
Eur Food Res Technol (2010) 230:497511
Principal Component Analysis (PCA), Multiple Factor Analysis (MFA), and Generalised Procrustes Analysis (GPA) are used for studying intra- and inter-laboratory variation. The main focus of the present paper is to discuss and to illustrate how techniques developed specically for performance visualisation of a single sensory panel [5] can also be applied for comparing multiple panels. Some of the techniques are related to the methods mentioned above, while others are new in this context. Univariate as well as multivariate statistical methods will be presented and used in this paper. The univariate methods highlight differences for each attribute separately while the multivariate methods look at differences at a more general level taking into account also correlations between the attributes. All presented techniques are graphically oriented and should be therefore easy to understand by practitioners and nonstatisticians. A major issue is to stress how the techniques can be used to highlight or visualise various types of differences between the assessors and the panels. Furthermore, a workow suggesting how to progress with the data analysis and how to use the methods available in the PanelCheck software will also be proposed. This allows for rapid and efcient analysis of sensory proling data, both in case of one or multiple panels. The software provides an intuitive and easy-to-use graphical user interface that handles all statistical computations in the background and visualises results in different types of plots. This enables the practitioner and non-statistician to concentrate on performance analysis rather than spending time on trying to apply algorithms on data by themselves. The open source PanelCheck software may be downloaded, distributed, and used for free (http://www.panelcheck.com) [10].
Experimental The dataset discussed here is the result of a joint project between Danish, Norwegian, Swedish, and English research institutes and commercial companies. In all, 26 panels were involved in the project (research as well as industry panels) with one of the aims being to investigate performance of multiple sensory panels with the PanelCheck software. The samples studied were ve candies (wine gums) produced according to an experimental design with two design factors, i.e. sugar level and acid content: A1 (high sugar, low acid), A2 (high sugar, high acid), B (medium sugar, low acid), C1 (low sugar, low acid), C2 (low sugar, high acid). All samples were produced at LEAF Denmark. The evaluation of the samples had to be performed within 1 month after production. LEAF Denmark guaranteed that samples did not change its sensory properties within this
period. The candy samples were tested by each of the 26 participating panels. Each sensory panel received detailed instructions about sample preparation and evaluation. The sensory panel at LEAF performed sensory proling on the samples and suggested nine sensory attributes, which the remaining 25 sensory panels were to use for proling. Two samples (A1, C1) were used for training and calibration by all sensory panels. Sample C2 was used as a reference sample for maximum intensity of attribute acidic avour. For the remaining attributes, either sample A1 or C1 were used as reference for low or high intensity. All attributes were evaluated on an intensity scale from 0 (no intensity) to 15 (high intensity). Water was used to clean the palate between each sample. The nine attributes used to describe the samples were: transparency, acidic avour, sweet taste, raspberry avour, sugar coat (the thickness of the sugar peel visual on the cut wine gum piece), biting strength in the mouth (referred to as biting), hardness, elasticity in the mouth (referred to as elasticity), sticking to teeth in the mouth (referred to as sticking). Each of the 5 samples was evaluated in 3 replicates, resulting in a total of 15 samples to be tested by each panel. One piece of wine gum weighed 3.5 g. In each serving, the assessors got four to ve pieces of which one was cut in half by the sensory staff allowing the assessors score on appearance attributes. For those panels that did not have access to specic software for automatic randomisation of candy samples, a Latin square design was provided as an example for serving order. All 26 sensory evaluations took place in June 2007. Table 1 shows an overview over the 26 panels indicating their number of assessors, size of the data matrix of each panel, and the size of the data used for the rst part of analysis that included all panels.
Methods In the following section, the univariate and multivariate statistical methods used for data analysis will be discussed. The results of these methods are visualised in various plots helping non-statisticians to visually detect performance issues without having to know all details on the statistical methods. It should be emphasised that the real strength of these methods is revealed only when using them together. Each plot has its own special feature that represents an element of unique information, but their joint information content is what really provides a holistic overview over performance of the investigated panels. The methods will be presented in an order that complies with the suggested data analysis workow (see Work ow strategy).
123
Eur Food Res Technol (2010) 230:497511 Table 1 Overview over all sensory panels that participated in the prociency test
499
Sensory panel P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25
Number of assessors 7 11 8 11 10 15 9 3 11 8 7 7 5 7 8 8 9 8 6 7 6 10 11 7 10 4 213
Number of data rows in raw data (J*M*I) 105 165 120 165 150 225 135 45 165 120 105 105 75 105 120 120 135 120 90 105 90 150 165 105 150 60 3,195
Number of data rows used in global analysis 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 390
J, M, I the number of tested products, replicates and assessors, respectively
P26 Total
The same workow may be applied for one sensory panel at the time as well as for multiple sensory panels. In this sense, one needs to think in terms of groups and individuals such that the statistical methods may be applied appropriately. When analysing performance of one sensory panel, the panel as a whole represents the group level while the assessors represent the individual level. This changes, however, when applying the same methods on data from multiple panels. Here, the group consisting of 26 panels represents the group level whereas each single panel here represents the individual level. In other words, the group of 26 panels will be treated as one large panel with each panel representing one assessor. How this is done, in practice will be elaborated later in Data merging. In the description of the statistical methods in Mixed model ANOVA for assessing the importance of attributes to Prole and line plots below (considering performance analysis of only one panel) we let j = 1,,J denote the number of samples tested, m = 1,,M the number of replicates, k = 1,,K the number of attributes and
i = 1,,I the number of assessors. We let Xi denote the data matrix of assessor i with J*M rows and K columns. That means that Xi of each assessor is of dimension (J*M) 9 K in any of the 26 data sets. For the candy data set, the dimension of Xi then is (5*3) 9 9, with J = 5 samples, M = 3 replicates and K = 9 attributes. Mixed model ANOVA for assessing the importance of attributes As a rst step, mixed model (2- or 3-way) ANOVA can be used for assessing the importance of the used sensory attributes in detecting signicant differences between the samples. The method is based on modelling samples, assessors and their interactions in two-way ANOVA or samples, assessors, replicates and their interactions in three-way ANOVA, and then testing for the sample effect by the use of regular F tests. In each case (2-way or 3-way), the assessor and interaction effects are assumed to be random [11]. Only attributes that are signicant at certain level (in the case presented here a 5% signicance
123
500
level was chosen) for product effect are considered for further analysis. In case of the candy data, two-way ANOVA was used since the replicates of the tested samples were served in random order. If sample replicates were served systematically, say one replicate per session, three-way ANOVA (with main effects for assessor, sample and replicate and interactions between them) should be considered instead. The reason for this is that by testing the replicates in separate sessions, it is likely that additional systematic variance between the replicates will be introduced into the data. The replicate effect in threeway ANOVA then indicates whether a signicant systematic session based variation in the data is present or not. Tucker 1 plots In the next step, the multivariate analysis method Tucker-1 [12, 13] is applied in order to get an overview over assessor and panel performance using multiple attributes. Tucker-1 is essentially PCA on an unfolded data matrix consisting of all individual matrices Xi,av aligned horizontally. Here, Xi,av represents the matrix of one assessor of the dimension J 9 K where the sample score is based on the average across replicates, hence indicated with av in the index. This means that the dimension of this unfolded matrix is J 9 (I*K). In the case of our candy data set the dimension would be 5 9 (10*9), with J = 5 samples, and K = 9 attributes and say I = 10 assessors [if your panel consists of 10 assessors as it is the case for example for panels P05, P22 and P25 (see Table 1)]. I will of course vary according to the number of assessors in the panel and consequently the dimension (I*K). PCA on this unfolded matrix provides two types of plots that are of interest: a common scores plot and a correlation loadings plot. The common scores plot shows how the tested J samples relate to each other, i.e. it visualises similarities and dissimilarities between the samples along the found principal components. This plot gives no direct information on assessor or panel performance, but it is a valuable visualisation tool that helps the user to roughly and quickly investigate whether the panel could distinguish between the samples or not by taking the explained variances into account. If the explained variance in the rst few (usually two) PCs is relatively high, large systematic variation is present in the data, which again may indicate that the panel discriminates well between the samples. Note that the explained variance for a Tucker-1 common scores plot generally is somewhat lower for the rst few PCs compared to those from PCA on the ordinary consensus average matrix. This is because the Tucker-1 analysis is based on many more variables and therefore more noise is present in the data.
The correlation loadings plot provides performance information on each assessor and the sensory panel as a whole. The plot contains I*K dots, with each dot representing one assessor-attribute combination (e.g. attribute sweet taste of assessor 5, etc.). By highlighting different dots, either those of one assessor or those of one attribute, one can visualise the performance of individual assessors or the whole panel. The position of the dots within the plot provides information on how well an individual or the panel as a whole perform. The more noise the attribute of a particular assessor contains, the closer the dot will appear to the origo, i.e. the middle of the plot. The more systematic information an attribute of an assessor contains, the closer it will appear to the outer ellipse (100% explained variance for that attribute, see Fig. 4). The inner ellipse represents 50% explained variance and can be considered as a rule-of-thumb lower boundary of how much explained variance an attribute should at least have to be considered as good enough. It is recommended to consult also higher PCs, since some assessors might have much systematic variance in other dimensions than PC1 and PC2 and thus initially appearing as noisy. Detailed information on the statistical aspects and interpretations of Tucker-1 common scores plot and correlation plots are given in [3]. Manhattan plots Manhattan plots in general provide an alternative way to visualise systematic variation in data sets as described earlier [3]. They can be considered as a screening tool for quick identication of assessors that perform very differently from the other assessors. The information visualised by Manhattan plots may be computed with different statistical methods. In this paper, the Manhattan plots visualise information as implemented in the PanelCheck software. This means that PCA is applied on the individual data matrices Xi,av and the explained variance for each attribute is then visualised in the Manhattan plots. For the candy data at hand, I*K explained variances will be given. This means if the panel consists of say I = 10 assessors, the number of explained variances would be 10*9 given K = 9 attributes. Manhattan plots (see example in Fig. 6) visualise, in shades of grey, how much of the variability of each attribute and each assessor can be explained by the principal components (vertical axis). A dark colour tone indicates that only a small proportion of the variance has been explained, while a light colour tone indicates the opposite. Extreme points are black (0% explained variance) and white (100% explained variance). Typically, the colour will be darker for PC1 and then get lighter with each additional PC from top to bottom as the explained variance shown is cumulative over each PC. In other words, the
123
501
explained variance at PC3 is the sum of the explained variances of PC1, PC2 and PC3. The lighter a colour tone in a Manhattan plot is for a specic assessor-attribute combination, the more systematic variation is given. The explained variances may be sorted either by assessor or by attribute, depending on what is the main focus of investigation. When interested in checking performance between assessors one may investigate a total number of I plots consisting of K columns where each plot represents one assessor and each column within the plots represents one attribute. Here, one may look for similar colour patterns among the assessors and detect assessors that differ much from the others. If interested in how well an attribute is understood and used by the panel one may consider a total number of K plots consisting of I columns where each plot represents one attribute and each column represents one assessor in the plots. Here, one may investigate whether an attribute achieves high-explained variances with only few PCs or if many PCs are necessary. Moreover, it can be detected whether some assessors may have more systematic variance with fewer PCs than other assessors. In this sense, Manhattan plots may be used as a screening tool for quick detection of assessors that behave very differently or attributes that are not well explained relatively to one another. Both plotting variants are implemented in PanelCheck. More detailed information on the statistical aspects and interpretations of Manhattan plots are presented in [3]. Plots based on one-way ANOVA A discussion of the one-way ANOVA model about panel performance can be found in [5, 14]. From the one-way ANOVA model, we obtain three statistical quantities (F, p and MSE values) that are used to generate the so-called F plot, MSE plot and p*MSE plot as available in the PanelCheck software. These three statistical quantities are acquired by applying one-way ANOVA on each individual data matrix Xi and provide information on sample discrimination and repeatability on each assessor. The three plots are described in more detail below. F plot F plots are based on F values, which contain information on discrimination performance of each assessor. A total of I*K F values are computed and may be presented in a bar diagram with each bar representing the attribute of one specic assessor. The bar diagram can be accompanied with horizontal lines indicating different signicance levels. Typically, 1 and 5% level of signicance are used for this purpose. Generally, the higher an value F value of an individual assessor, the greater the ability of that assessor
to discriminate between tested samples. If differences between the tested samples are present, one should expect the assessors to obtain high F values greater, ideally higher than those corresponding to 1 and 5% level of signicance. MSE plot The MSE values are the mean square errors (random error variance estimates) from the one-way ANOVA model. They can be used as a measure of repeatability for each assessor. A total of I*K MSE values are computed and can be plotted in a bar diagram very similar to the F values in the F plot. If an assessor almost perfectly repeats her/ himself, this value should be close to zero. The less the repeatability of a certain assessor, the higher his/her MSE will be. The MSE value should, however, always be considered together with the F values in order to get a realistic overview over the assessors performance. An assessor aiming for low MSE values can achieve this through scoring about all samples alike, thus reducing differences between replicates. However, such an assessor will clearly have no discriminative power in the analysis, as the respective F values will also be very low. If differences between the samples are given, an assessor should ideally have high F values and low MSE values. p*MSE plots In a p*MSE plot [15] the assessors ability to detect differences between samples is plotted against their repeatability using the p values and MSE values from one-way ANOVA calculations. A total of I*K pairs of p and MSE values are computed and plotted in a scatter plot. They can be presented together in various ways (for instance all at the same time, only for one attribute at a time or only for one assessor at a time) and with highlighting of the assessors or attributes that one is particularly interested in. In an ideal situation all assessors should achieve low p values and low MSE values for all attributes [15] if differences between the samples are really present, thus ending up in the lower left corner of the plot. Prole and line plots Prole plots visualise how each assessor ranks and rates the tested samples compared to the other assessors and the panel consensus for a certain attribute (see example in Fig. 11). Each line represents one assessor (sample averages across replicates) whereas the single bold line represents the panel consensus (sample averages across assessors and replicates). The tested samples are ranked along the horizontal axis according to the panel consensus from left to right with increasing intensity for that attribute.
123
502
The vertical axis represents the scores (average across replicates) of the particular assessor for the samples. In case of high agreement between assessors, the assessor lines follow the consensus line closely. With increasing disagreement, the line of each assessor will follow its own course and the plot will appear as more cluttered. Each line plot [14] represents one sample showing its average scores on each attribute in form of a line connecting each attribute from left to right (see example in Fig. 8). In addition, raw data scores may be superimposed indicating how individual assessors have scored on the particular sample. The vertical line for each attribute displays the scoring range used by all assessors for that given attribute and each symbol represents one of multiple scores provided by the panel.
to identify. With fewer panels, though, this may be a valid approach as the number of assessors also will be lower. Sample averages across assessors and replicates for each panel Another possibility is to compute consensus sample averages for each panel across assessors and replicates. By doing so one will have available 26 new consensus data matrices of dimension J 9 K. For the candy data at hand, the dimension of these matrices then is (5 9 9) with J = 5 samples and K = 9 attributes. The next step would be to concatenate these consensus matrices vertically, resulting in a merged data matrix of dimension (26*5) 9 9 and import it into PanelCheck. In this case, each panel is treated as it was an individual assessor in a sensory panel consisting of 26 assessors. Unfortunately, with this approach, one loses information on repeatability and performance of individual assessors, since the sample averages were computed across assessors and replicates and information on these two factors are lost. Hence, plots visualising repeatability performance are not available in this case, which are F plot, MSE plot and p*MSE plot. Sample averages across assessors for each panel A third alternative is to compute sample replicate averages for each panel across the assessors of that particular panel. This will lead to 26 data matrices of dimension (J*M) 9 K, which for the candy data at hand means (5*3) 9 9, with J = 5 samples and M = 3 replicates. This is also indicated in Table 1. The resulting data matrix then is of dimension (26*15) 9 9 when concatenating all 26 data matrices vertically and is ready for import into PanelCheck. In this way, again each panel is treated as it was an individual assessor in a sensory panel consisting of 26 assessors, but this time information on repeatability is available as replicate information is preserved on the panel level. Data merging approaches used in this study For the rst part of the analysis (Global analysis of all 26 panels) where all 26 panels are investigated, the approach described in Sample averages across assessors for each panel was chosen, since it provides a performance overview over all panels and at the same time preserves information on replicates on a panel level. This approach can be seen as a middle way between the two approaches described above in Raw data and Sample averages across assessors and replicates for each channel. It is a valuable approach when a large amount of data with many panels is given, as in this study.
Data merging and work ow strategy In this section, we describe how to prepare and merge the sensory proling data of the 26 panels prior to import into the PanelCheck software. Furthermore, we propose a work ow that suggests how to progress with the data analysis, i.e. which plots to use rst and depending on the information found which plots to use further on. All methods used here are integrated in the PanelCheck software and thus may be accessed easily. The only exception is the method used in PCA for investigating basic structure in data. This particular analysis, however, can be easily carried out using any multivariate statistics software package that gives access to PCA. This work ow may also be applied to single data sets from one panel. Data merging Before analysing the 26 data sets, some data pre-processing and re-arranging is necessary. There are several possibilities of how data may be merged prior to import into the PanelCheck software. Raw data The most obvious way would be to concatenate all data sets vertically, which practically would result in a single large sensory panel with 213 assessors accumulated over all the 26 panels. The dimension of this matrix would then be 3,195 9 9 (see Table 1, last row). By choosing this approach, individual information on all 213 assessors is preserved and available in the plots. In return, however, interpretation might become cumbersome and challenging, as some of the plots get crowded and unreadable with so many assessors. Given this situation, performance issues on individuals or a particular panel as a whole may be difcult
123
503
For the second part of the analysis (Local analysis of panels P05, P17 and P25), focus is turned to only three of the 26 panels and the individual assessors that belong to them. These three panels (P05, P17 and P25) were identied to be differing somewhat from to the other panels on a number of attributes, as visualised with Tucker-1 plots (see Results). Given this situation where only three panels are to be analysed in detail, the data setup as described in Raw data is an appropriate approach. With the amount of raw data greatly reduced (down to three from 26) information on individual assessors will be more readable in the plots. Work ow strategy The proposed work ow strategy (Fig. 1) is by no means a hard rule that represents the perfect general approach to analysing and visualising all types of sensory proling data. It should rather be seen as a guide or path that one may follow when analysing a new data set, and which may be left at any time in the data analysis process. Since each data set may have its own unique characteristics, it may require a unique approach and a different order of methods and plots to be used for analysis. In the proposed workow, a good starting point could be a (either two-way or three-way) ANOVA to identify signicant attributes at 5% signicance level, i.e. P \ 0.05. Non-signicant attributes close to signicance may also be considered, since only a few noisy assessors might be enough to make the attribute switch from signicant to non-signicant. Attributes which are far from being signicant (say P values of 0.1 and above) may be disregarded based on high likelihood that differences between the tested samples are not present. Preferably, this cut-off limit needs to be chosen by the panel leader who has full knowledge about the tested products and knows how well the assessors of his/her sensory panel usually perform. For the next step, one may consult Tucker-1 and Manhattan plots. Tucker-1 correlation loadings plots as implemented in the PanelCheck software are based on replicate averages, i.e. they do not contain information on repeatability. They do, however, provide some quick diagnostics that may be conrmed with other plots especially suited to visualise that particular kind of problem. Depending on how the assessors are distributed over the plots one may identify possible disagreement in sample ranking, poor sample discrimination ability, or crossover effects by turning the intensity scale upside down. Manhattan plots may be used as a screening tool to identify deviating performances based on the patterns found in the plots. The next plots suggested are those based on one-way ANOVA carried out on the individual data matrices Xi of each individual. Those plots are the p*MSE plot, F plot and MSE plot. If an assessor lies, e.g. close to the centre of the
Tucker-1 correlation loadings plot the reason for this often is poor discrimination ability of that particular assessor compared to others closer to the outer ellipse. This may be conrmed by the p*MSE plot or F plot. If poor sample discrimination cannot be conrmed by either one-way ANOVA plot another likely scenario might be ranking disagreement. In this case, the problematic assessor does not agree with the underlying structure found by Tucker-1 in the rst two PCs. This particular assessor might discriminate well between the samples, however, not in the same way as the panel consensus. Therefore, such an assessor may show systematic variation in PC3 or higher. This may be conrmed by prole plots. If none of the plots mentioned above allows for a conclusion, one might want to consult line plots for visualising the raw data of every sample. Studying details on the raw data might help to reveal issues that are not caught with other plots. With the help of the work ow, one may analyse one attribute at the time and nish analysis when performance on all attributes has been evaluated.
Results In this section, we will rst investigate sensory proling data of all 26 participating panels (Global analysis of all 26 panels) before we go further into detail by looking into performance of only a few selected panels (Local analysis of panels P05, P17 and P25) that vary somewhat from most other panels. Global analysis of all 26 panels Two-way ANOVA Following the workow shown in Fig. 1, a two-way ANOVA (details in Mixed model ANOVA for assessing the importance of attributes) was computed rst. The results are shown in Fig. 2. All attributes were signicant with P \ 0.001, hence all attributes were kept for further analysis. PCA for investigating basic structure in data The purpose of this analysis step is to get a quick and general overview over how the data is structured and to identify panels that may differ greatly in regard to how they perceive differences between the tested samples. This is done by applying PCA on the merged data set as described in Sample averages across assessors and replicates for each panel. The results are reported in Fig. 3ac, showing the explained variance, scores and loadings, respectively. Figure 3a shows that the two-rst principal components
123
504 Fig. 1 Proposed workow for the analysis of assessor and panel performance
explain 93% of the total variability contained in the dataset. Figure 3b shows how the samples are distributed in the multivariate space. Each sample is represented 78 times (3 replicates 9 26 panels) with a fairly good separation between the samples. The scores plot shows that the rst axis discriminates between samples A2, B and C1 on one side versus A1 and C2 on the other. The samples in the latter group are characterised by high intensity for the attributes sweet taste, sugar coat and to a certain extent acidic avour and raspberry avour (see loadings plot in Fig. 3c). The former group is characterised by high
intensity for attributes sticking, transparency, elasticity, hardness and biting. Along the second axis, there seems to be a split between samples A2 and B (samples on the left side of the scores plot) and A1 and C2 (samples on right side of score the plot). This tendency is strongly related to attribute acidic avour with high intensity for samples A2 and C2 and low intensities for samples A1 and B. This is in accordance with the experimental design described above (Experimental). Attribute sweet taste also seems to contribute to the split along PC2 although not in the same degree as acidic avour. Moreover, there
123
505
Fig. 2 Product effect in the two-way ANOVA model based on 26 panels. All attributes are signicant with P \ 0.001 and are included in further analysis
seems to be no clear coherence with the sugar content in the samples as one should expect from the experimental design. Nonetheless, the scores plot shows that the panels are in good agreement regarding how the samples differ from each other except for one evaluation of sample C2 and one evaluation of A1. Other than that there are no anomalies to be detected which rules out that there are severe differences between any of the panels. Tucker-1 and Manhattan plots of all panels For the next step, Tucker-1 correlation loadings plots (Fig. 4) are utilised to identify attributes with potential performance issues. For the data at hand, nine identical plots are given (for nine attributes) with one attribute being highlighted at the time. By screening through the plots, one can see that the overall performance between the 26 panels can be considered as very good for most of the attributes. A very large part of the variation in the data is explained using only PC1 and PC2. The total amount of variance explained by PC1 and PC2 is 98.6%, with PC1 and PC2 explaining 92.6 and 6.0%, respectively. From previous experience, we can inform that this number is very high compared to other data sets, despite the high number of 234 variables (26 panels 9 9 attributes). One important reason for this is that much of the noise was eliminated by averaging sample scores over assessors. The plots show that none of the panels is in the inner ellipse for any attributes meaning that all of them have more than 50% systematic variation of the variation explained by PC1 and PC2. For all attributes
Fig. 3 a Explained variances from PCA on the data described in Sample averages across assessors for each panel. The upper (full line) and lower (dashed) line visualise the calibrated and validated explained variance, respectively. b PCA on the data described in Sample averages across assessors for each panel. The PCA scores plot visualises how the 26 panels discriminated between the ve tested samples. c PCA on the data described in Sample averages across assessors for each panel. PCA loadings plot showing how the attributes contributed to the variation in the merged data set
123
506
Fig. 4 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the proling. There is some variation between the panels for attributes acidic avour, sweet taste and raspberry avour
except acidic avour, sweet taste and raspberry avour the 26 panels show very good agreement as they are well clustered at the outer ellipse. For the three attributes mentioned above, there is some disagreement since the panels are more spread out along the outer ellipse. Attributes acidic avour and sweet taste are the only attributes contributing to systematic variation in PC2. Furthermore, it is obvious that panel P01 disagrees with the other panels on attribute sticking, since it is located on the opposite side of the other panels. From previous experience, it is known that such a situation is caused by turning the scale upside-down, i.e. confusing high and low intensity. This assumption is conrmed by the prole plot for attribute sticking as shown in Fig. 5. Panel P01 seems to have confused high and low intensity for the tested samples. Moreover, one may observe in Fig. 4 that panel P19 has less systematic variation for attribute elasticity compared to the other panels. A prole plot of attribute elastic (not shown) reveals that panel P19 ranks the samples identically to the consensus, however, its intensity differences between the samples deviate somewhat from that of the consensus. This is why panel P19 lies in the same direction as the remaining panels in the Tucker-1
correlation loading plots, but it does not align as well with the other panels. After screening through the Tucker-1 plots one may consult Manhattan plots (Fig. 6) for comparison of the systematic variation for a specic attribute across all
Fig. 5 Prole plot of attribute sticking. Panel P01 clearly stands out from the other panels because of opposite scoring on high and low intensity of the tested samples
123
507
panels. The Manhattan plots conrm what was shown in the Tucker-1 plots. The attributes acidic avour, sweet taste and raspberry avour need two or more principal components to reach a high level of explained variance. For the remaining attributes, all panels reach a high percentage of explained variance already after one principal component. The only exception is attribute elasticity, where one can easily see that panel 19 differs from the other panels. The lone dark bar indicates that panel P19 has less systematic variance for this attribute than the other panels and needs three to four principal components before explained variance is comparable with the other panels. For this attribute, all panels have an explained variance that is higher than or very close to 99% using only PC1 except panel 19 with only 62% after PC1. After PC3, the cumulative explained variance of 90% for panel P19 is still somewhat lower than those of the other panels. With 4 PCs panel P19 reaches 100% explained variance.
p*MSE, MSE and F plots based on one-way ANOVA The p*MSE plots are not presented here since sample discrimination is highly signicant for all attributes across all panels. Of the 234 given p values (26 panels 9 9 attributes) the highest was at P = 0.037. In Fig. 7a and b, the F and MSE plot are presented, respectively. As can be seen, some of the panels have a much higher F value than others even though all of them are signicant at 1% level. The horizontal lines indicating F values at 1 and 5% levels cannot be seen here, since some F values are extremely high. Both horizontal lines therefore fall onto the vertical axis as their corresponding F values are extremely low compared to the highest F values in the plot. When investigating the panels discrimination ability one can see for instance that panel P11 has relatively low F values compared to those of panel P20. At the same time, the MSE values (Fig. 8b) of panel P11 are relatively high. This indicates that panel P11 is somewhat
Fig. 6 Nine Manhattan plots, one for each attribute, visualising systematic variation from individual PCA on the data of each panel. Vertical axes represent the number of PCs used and their corresponding cumulative explained variance. Horizontal axes
represent the respective sensory panels. Black colour corresponds to 0% explained variance, whereas white colour corresponds to 100% explained variance
123
508
less precise and has lower capability of detecting differences. Panel P20 on the other hand has relatively low MSE (good repeatability) values combined with relatively high F values (good sample discrimination) indicating a much better performance than panel P11. Panel P21 is an example of where high F values are achieved, however, coupled with high MSE values. In other words, this panel discriminates well between the tested samples, but less precisely so than panel P20. In terms of performance panel P21 may be ranked between panel P20 (good) and panel P11 (not as good). Still, panel P11 shows an acceptable performance since its F values are all signicant at 1% level. Note that the F and MSE plots provide no information on sample ranking differences and that these two plots alone therefore are not sufcient to get a complete evaluation on panel performance. It should be mentioned that both plots could also be sorted by attribute to check which of the attributes have the lowest/highest variance and the best ability to distinguish between samples.
Line plots Figure 8 shows line plots of the ve tested samples. The plots highlight that for every sample and attribute there is a varying degree of variability across the panels (vertical lines indicating spread of the scores). This variability could be due to for instance local differences in calibration. This is particularly true for attribute 9 (transparency). For attribute 5 (sticking), however, there seems to be a higher degree of agreement among the panels. Local analysis of panels P05, P17 and P25 After studying all the panels averages across assessors (based on data as described in Sample averages across assessors for each panel), the data of panel P05, P17 and P25 were analysed in more detail. These three panels were picked over others because they differ from each other for the attributes acidic avour, sweet taste and raspberry avour. They are spread somewhat in terms of location in the Tucker-1 plot of these three attributes. Since we are now focusing on only three panels and we wish to analyse in more detail why these three panels differ somewhat for the attributes mentioned above, we will use their raw data from here on. In order to do that, their raw data needs to be merged as described in Raw data before being imported into PanelCheck. When merging the raw data of panel P05, P17 and P25, the resulting data matrix will be of dimension (435 9 9), with panel 5 contributing 150 rows (10 assessors 9 5 samples 9 3 replicates), panel 17 contributing 135 rows (9 assessors 9 5 samples 9 3 replicates) and panel 25 contributing 150 rows of data (10 assessors 9 5 samples 9 3 replicates). See Table 1 for details on panel sizes. This new data set in practise represents one new large panel consisting of 29 individuals (10 ? 9 ? 10 assessors). By using the same methods as before, now the performance of individuals belonging to one of these three panels can be visualised. Mixed model ANOVA Mixed model ANOVA again reports that all attributes are signicant at level P \ 0.001, meaning that this new panel consisting of 29 individuals discriminated well between the samples (plot not shown). Again, all attributes were considered for further analysis.
Fig. 7 a F plots visualising the panels ability to discriminate between the tested samples for each attribute. Panel P21 discriminates less between the samples than for example panel P20. The horizontal lines indicating F values at signicance level 1 and 5% are not visible as they are very low and therefore fall onto the horizontal axis. b MSE plot visualising the repeatability of each panel. Panel P21 obviously has a weaker performance regarding repeatability than for example panel P06
Tucker-1 plots Tucker-1 plots based on raw data from the three selected panels (Fig. 9) conrm what has been spotted in the Tucker-1 plot above (Fig. 4) based on the data from all 26 panels. There is substantial disagreement across assessors
123
509
Fig. 8 Five line plots where each plot represent the data of one sample. Vertical axes represent intensity scores. Horizontal axes represent the nine sensory attributes
Fig. 9 Nine identical Tucker-1 plots with each plot highlighting one of the nine attributes used in the proling. The plots are based on raw data of panel P05, P17 and P25
for the attributes acidic avour, sweet taste and especially raspberry avour indicating that further improvement on agreement across assessors is possible. Although
the assessors are somewhat scattered over the correlation loadings plots, most of them have high-explained variances for the rst two PCs. This indicates that the majority
123
510
p*MSE plots As opposed to the situation above with all 26 panels with high signicance (p*MSE, MSE and F plots based on one-way ANOVA), in this case the p*MSE plots plot (Fig. 10) gives a valuable contribution to understanding individual differences. It can be seen that for attribute raspberry avour panel P17 and to a certain extent panel P05 are less capable to detect differences between the samples (larger P values) than panel P25. Moreover, random noise is generally larger for panel P05 and P17. This indicates that the individuals of panel P25 and therefore panel P25 as a group perform much better than panel P05 and P17.
Fig. 10 p*MSE plot for attribute raspberry avour for panels P05, P17 and P25
Prole plot for panels Prole plots (Fig. 11) show that the disagreement in evaluating the samples is strongest for the attributes acidic avour, sweet taste and particularly for raspberry avour, as already observed in the Tucker-1 plots in Fig. 9. For attributes transparency, sugar coat, biting, hardness and elasticity the proles are very alike for most of the assessors with very few exceptions. For attribute stickiness three assessors of panel P17 (individuals P17-1, P17-4 and P17-9) generally rated the samples with the highest intensity (B, C1 and A2) lower than the remaining assessors.
discriminates well between the samples but that there might be disagreement on sample ranking for that particular attribute. Studying the three correlation plots in detail conrms this by revealing that the assessors of each panel tend to form clusters of their own within the plot. For the remaining six attributes (transparency, sugar coat, biting, hardness, elasticity and sticking) overall agreement is very good. These results were conrmed by the Manhattan plots (not shown).
Fig. 11 Nine prole plots, one for each attribute, visualising sample intensity and rankings for each assessor in panels P05, P17 and P25. Vertical axes represent sample intensity scores. Horizontal axes
represent the ve tested samples sorted by intensity based on consensus. The circle highlights three deviating assessors belonging to panel P17 (assessor P17-1, P17-4 and P9)
123
511 Acknowledgments Thanks to Rikke Lazarotti at LEAF Denmark for production of wine gum samples and providing access to the sensory proling data. We would like to thank the Research Council of Norway (project number 168152/110), The Foundation for Research Levy on Agricultural Products (Norway) and The Danish Food Industry Agency for project funding.
Summary and discussion In this paper, we present how to extract critical information on panel performance issues from a prociency test. In the example described in this paper, 26 sensory panels tested a set of 5 candy samples produced according to an experimental design with 3 replicates using 9 attributes. Since the panels varied in size, with 3 assessors at the least and 15 assessors at the most, the size of the data from each panel varied thereafter. We demonstrated how to arrange the large amount of data prior to analysis and which methods to use in the analysis process. For this, we proposed a general workow that may be used as a guide through the data analysis process, but which is not forced upon the user. For the data at hand, performance analysis is carried out rst at a global level, based on data from all 26 panels where each panel treated as it was an individual assessor. This means that rather than visualising performance of individuals, it is the performance of panels as a whole compared to other panels that is visualised. As a result, from this process three of the 26 panels were identied for further analysis at a more detailed local level. This included performance visualisation of individual assessors from each of these three panels. In both cases, the same methods were applied to gather information on performance. The methods used were mixed model ANOVA, Tucker-1 plot, Manhattan plot, one-way ANOVA based F plot, MSE plot, p*MSE plots, prole plot and line plot. Reason for using multiple plots and their methods is that each of the plots contains unique information on panel and assessor performance. Their joint information content provides a more complete performance overview on individual assessors and their sensory panel (local level) or sensory panels compared with each other (global level). Performance information from such an analysis can then be used by panel leaders as feedback to improve over panel performance and performance of individual assessors.
References
1. Brockhoff P, Skovgaard I (1994) Modelling individual differences between assessors in a sensory evaluation. Food Qual Prefer 5:215224 2. Ns T (1990) Handling individual differences between assessors in sensory proling. Food Qual Prefer 2:187199 3. Dahl T, Tomic O, Wold JP, Ns T (2008) Some new tools for visualizing multi-way sensory data. Food Qual Prefer 19:103113 S, Page ` s J, Husson F (2008) Methodology for the comparison 4. Le of sensory proles provided by several panels: application to a cross-cultural study. Food Qual Prefer 19:179184 5. Tomic O, Nilsen A, Martens M, Ns T (2007) Visualization of sensory proling data for performance monitoring. LWT-Food Sci Technol 40:262269 6. Thompson M, Wood R (1993) The international harmonised protocol for the prociency testing of (chemical) analytical laboratories. Pure Appl Chem 65:212123 7. McEwan JA (1999) Comparison of sensory panels: a ring trial. Food Qual Prefer 10:16171 8. Hunter EA, McEwan JA (1998) Evaluation of an International Ring Trial for sensory proling of hard cheese. Food Qual Prefer 9:343354 ` s J, Husson F (2001) Inter-laboratory comparison of sensory 9. Page proles methodology and results. Food Qual Prefer 12:297309 s, Norway. 10. PanelCheck software (2006) Noma Mat, A http://www.panelcheck.com 11. Ns T, Langsrud (1988) Fixed or random assessors in sensory proling? Food Qual Prefer 9:145152 12. Tucker LR (1964) The extension of factor analysis to threedimensional matrices. In: Frederiksen N, Gulliksen H (eds) Contributions to mathematical psychology. Holt, Rinehart & Winston, New York 13. Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279311 14. Ns T, Solheim R (1991) Detection and interpretation of variation within and between assessors in sensory proling. J Sens Stud 6:159177 15. Lea P, Rdbotten M, Ns T (1995) Measuring validity in sensory analysis. Food Qual Prefer 6:321326
123

Artigo Panel Check

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Artigo Panel Check

Transféré par

Droits d'auteur :

Formats disponibles

Eur Food Res Technol (2010) 230:497511 DOI 10.

Eur Food Res Technol (2010) 230:497511

Number of assessors 7 11 8 11 10 15 9 3 11 8 7 7 5 7 8 8 9 8 6 7 6 10 11 7 10 4 213

Number of data rows used in global analysis 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 390

J, M, I the number of tested products, replicates and assessors, respectively

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Eur Food Res Technol (2010) 230:497511

Vous aimerez peut-être aussi