Vous êtes sur la page 1sur 19

Structure of a Data Analysis Report A data analysis report is somewhat dierent from other types of professional writing that

you may have done or seen, or will learn about in the future. It is related to but not the same as: A typical psych/social science paper orgainzed around intro/methods/analysis/results/discussion sections. A research article in an academic journal. An essay. A lab report in a science class. The overall structure of a data analysis report is simple: 1. Introduction 2. Body 3. Conclusion(s)/Discussion 4. Appendix/Appendices The data analysis report is written for several dierent audiences at the same time Primary audience: A primary collaborator or client. Reads the Introduction and perhaps the Conclusion to nd out what you did and what your conclusions were, and then perhaps shes/skims through the Body, stopping only for some additional details on the parts that he/she thought were interesting or eye-catching. Organize the paper around an agenda for a conversation you want to have with this person about what youve learned about their data: e.g., from most general to most specic, or from most important to least important, etc. Provide the main evidence from your analysis (tabular, graphical, or otherwise) in the Body to support each point or conclusion you reach, but save

more detailed evidence, and other ancillary material, for the Appendix. Secondary Audience: An executive person. Probably only skims the Introduction and perhaps the ulConclusion to nd out what you did and what your conclusions are. Leave signposts in the Introduction, Body and Conclusion to make it easy for this person to swoop in, nd the headlines of your work and conclusions, and swoop back out. Secondary Audience: A technical supervisor. Reads the Body and then examines the Appendix for quality control: How good a job did you do in (raising and) answering the interesting questions? How ecient were you? Did you reach reasonable conclusions by defensible statistical methods? Etc. Make specic cross-references between the Body and specic parts of the Appendix so that this person can easily nd supporting and ancillary material related to each main analysis you report in the Body. Add text to the technical material in the Appendix so that this person sees how and why you carried out the more detailed work shown in the Appendix. 1The data analysis report has two very important features: It is organized in a way that makes it easy for dierent audiences to skim/shthrough it to nd the topics and the level of detail that are of interest to them. The writing is as invisible/unremarkable as possible, so that the content of the analysis is what the reader remembers, not distracting quirks or tics in the writing. Examples of distractions include: Extra sentences, overly formal or owery prose, or at the other extreme overly casual or overly

brief prose. Grammatical and spelling errors. Placing the data analysis in too broad or too narrow a context for the questions of interest to your primary audience. Focusing on process rather than reporting procedures and outcomes. Getting bogged down in technical details, rather than presenting what is necessary to properly understand your conclusions on substantive questions of interest to the primary audience. It is less important to worry about the latter two items in the Appendix which is expected to be more detailed and process-oriented. However, there should be enough text annotating the technical material in the Appendix so that the reader can see how and why you carried out the more detailed work shown there. The data analysis report isnt quite like a research paper or term paper in a class, nor like a research article in a journal. It is meant, primarily, to start an organized conversation between you and your client/collaborator. In that sense it is a kind of internal communication, sort of like an extended memo. On the other hand it also has an external life, informing a boss or supervisor what youve been doing. 2Now lets consider the basic outline of the data analysis report in more detail: 1. Introduction. Good features for the Introduction include: Summary of the study and data, as well as any relevant substantive context, background, or framing issues. The big questions answered by your data analyses, and summaries of your conclusions about

these questions. Brief outline of remainder of paper. The above is a pretty good order to present this material in as well. 2. Body. The body can be organized in several ways. Here are two that often work well: Traditional. Divide the body up into several sections at the same level as the Introduction, with names like: Data Methods Analysis Results This format is very familiar to those who have written psych research papers. It often works well for a data analysis paper as well, though one problem with it is that the Methods section often sounds like a bit of a stretch: In a psych research paper the Methods section describes what you did to get your data. In a data analysis paper, you should describe the analyses that you performed. Without the results as well, this can be pretty sterile sounding, so I often merge these methods pieces into the Analysis section when I write. Question-oriented. In this format there is a single Body section, usually called Analysis, and then there is a subsection for each question raised in the introduction, usually taken in the same order as in the introduction (general to specic, decreasing order of importance, etc.). Within each subsection, statistical method, analyses, and conclusion would be described (for

each question). For example: 2. Analysis 2.1 Success Rate Methods Analysis Conclusions 2.2 Time to Relapse Methods Analysis Conclusions 32.3 Eect of Gender Methods Analysis Conclusions 2.4 Hospital Eects Methods Analysis Conclusions Etc. . . Other organizational formats are possible too. Whatever the format, it is useful to proovide one or two well-chosen tables or graphs per question in the body of the report, for two reasons: First, graphical and tabular displays can convey your points more eciently than words; and second, your skimming audiences will be more likely to have their eye caught by an interesting graph or table than by running

text. However, too much graphical/tabular material will break up the ow of the text and become distracting; so extras should be moved to the Appendix. 3. Conclusion(s)/Discussion. The conclusion should reprise the questions and conclusions of the introduction, perhaps augmented by some additional observations or details gleaned from the analysis section. New questions, future work, etc., can also be raised here. 4. Appendix/Appendices. One or more appendices are the place to out details and ancillary materials. These might include such items as Technical descriptions of (unusual) statistical procedures Detailed tables or computer output Figures that were not central to the arguments presented in the body of the report Computer code used to obtain results. In all cases, and especially in the case of computer code, it is a good idea to add some text sentences as comments or annotations, to make it easier for the uninitiated reader to follow what you are doing. It is often dicult to nd the right balance between what to put in the appendix and what to put in the body of the paper. Generally you should put just enough in the body to make the point, and refer the reader to specic sections or page numbers in the appendix for additional graphs, tables and other details. 2.2.5 Quality Control What is the purpose of this element? There is potential variability in any sample collection, analysis,or measurement activity, with field variability generallycontributing more than laboratory variability. In an environmental monitoring project, total study error can be divided into between-sampling-unit variability (influenced by sampling design error and inherent spatial variability) and

within-sampling-unit variability (due to small-scale within unit variability, and Suggested Content for Quality Control List of QC activities needed for sampling, analytical, or measurement techniques, along with their frequency Description of control limits for each QC activity and corrective actions when these are exceeded Identification of any applicable statistics to be variability due to sampling, analytical, and data manipulations). This section lists those checks that can be performed to estimate that variability. For a more detailed discussion of sampling unit variability, review EPAs Guidance for Choosing a Sampling Design for Environmental Data Collection (QA/G-5s) 2002a. What information should be included in this element? QC activities are those technical activities routinely performed, not to eliminate or minimize errors, but to measure or estimate their effect. The actual QC data needs are based on the decision to be made and the data quality specifications for the project. Here you should list all the checks you are going to follow to assess/demonstrate reliability and confidence in your information. For example, contamination occurs when the analyte of interest, or another compound, is introduced through any one of several project activities or sources, such as contaminated equipment, containers, and reagents. Blanks are clean samples used to measure the sources of contamination at different collection and measurement stages.

Bias is systematic error. A variety of QC samples can be used to determine the degree of bias, such as analysis of samples with a known concentration of the contaminant of concern. These are known as standards, matrix spike samples, and matrix-specific QC samples. For example, calibration drift is a nonrandom change in a measurement system over time and is often detectable by periodic remeasurement of calibration check standards or samples. Imprecision is random error, observed as different results from repeated measurements of the same or identical samples. Replicate samples and split samples are commonly used to denote the level of precision in the measurement or collection system. For example, a sample split in the field and sent to two different laboratories can be used to detect interlaboratory precision. A sample split in a laboratory and then analyzed separately can indicate analytical precision, while a sample repetitively measured with one instrument can determine instrumental precision. For each measurement activity, identify those QC checks that will be followed in this project, and indicate at what frequency each will occur. This can include items such as field collocated, duplicate, and matrix spike samples and laboratory duplicate, matrix spike, and control samples. The QA Project Plan may identify and describe the documentation procedures for QC activities such as: One in ten field samples or one per batch will be a replicate sample, with a batch being defined as twenty or fewer samples per preparation test method; The spike compound will be analyzed at a concentration of five to seven times the suspected concentration level;

A proficiency test (PT) sample will be evaluated once per quarter. Final When you identify the QC activity control limits (described in Section 2.2.4), tell what is to be done when these are exceeded. For example, what will happen when the blank sample comes out positive for the contaminant-of-concern? Cited methods usually do not provide this information or it may be insufficient for the needs of your project. State how the effectiveness of control actions will be determined and documented. For example, if the senior taxonomist determines that the junior taxonomist has misidentified x% of macro-Final invertebrate samples, retraining may be specified until accuracy, i.e., correct identification has improved, and the retraining is recorded in the project files. For these QC samples, identify also the procedures, formulae, or references for calculating applicable statistics, such as estimating sample bias and precision. It is useful to summarize QC activities depending on whether they are in the field or the laboratory. This way, appropriate personnel can quickly identify the QC samples that apply to their activities. Tables D-10 and D-11 in Appendix D are examples of tables that can be used to denote sampling and analytical QC activities. Remember that QC activities vary considerably between environmental monitoring programs and between different agencies. They do incur a cost to the project, which should be included during project planning by management and/or decision makers. In other words, the x% of samples to be

analyzed as blanks should be considered to be an inherent part of the analytical process, not as an expendable add-on.
When Structures Are Used
OPC Toolbox software uses structures to return data from an OPC server, for the following operations:

Synchronous read operations, executed using the read function. Asynchronous read operations, executed using the readasync function. Data change events generated by the OPC server for all active, subscribed groups or through a refresh function call. Retrieving logged data in structure format from memory using the getdata or peekdata functions.

In all cases, the structure of the returned data is the same. This section describes that structure, and how you can use the structure data to understand OPC operations.

Back to Top

Example: Performing a Read Operation on Multiple Items

To demonstrate how to use structure formatted data, the following example reads values from three items on the Matrikon OPC Simulation Server. Step 1: Create OPC Toolbox Group Objects This example creates a hierarchy of OPC Toolbox objects for the Matrikon Simulation Server. To run this example on your system, you must have the Matrikon Simulation Server installed. Alternatively, you can replace the values used in the creation of the objects with values for a server you can access.

da = opcda('localhost','Matrikon.OPC.Simulation.1'); connect(da); grp = addgroup(da,'StructExample'); itm1 = additem(grp,'Random.Real8'); itm2 = additem(grp,'Saw-toothed Waves.UInt2'); itm3 = additem(grp,'Random.Boolean');
Step 2: Read Data This example reads values first from the device and then from the server cache. The data is returned in structure format.

r1 = read(grp, 'device'); r2 = read(grp);

Step 3: Interpret the Data The data is returned in structure format. To interpret the data, you must extract the relevant information from the structures. In this example, you compare the Value, Quality, and TimeStamp to confirm that they are the same for both read operations.

disp({r1.ItemID;r1.Value;r2.Value}) disp({r1.ItemID;r1.Quality;r2.Quality}) disp({r1.ItemID;r1.TimeStamp;r2.TimeStamp})

Step 4: Read More Data By reading first from the cache and then from the device, you can compare the returned data to see if any change has occurred. In this case, the data will not be the same.

r3 = read(grp); r4 = read(grp, `device'); disp({r3.ItemID;r3.Value;r4.Value})

Step 5: Clean Up Always remove toolbox objects from memory, and the variables that reference them, when you no longer need them.

disconnect(da) delete(da) clear da grp itm1 itm2 itm3

Back to Top

Interpreting Structure Formatted Data

All data returned by the read, opcread, and getdata functions, and included in the data change and read async event structures passed to callback functions, has the same underlying format. The format is best explained by starting with the output from the read function, which provides the basic building block of structure formatted data. Structure Formatted Data for a Single Item When you execute the read function with a single daitem object, the following structure is returned.

rSingle = read(itm1)

rSingle =

ItemID: 'Random.Real8' Value: 1.0440e+004 Quality: 'Good: Non-specific' TimeStamp: [2004 3 10 14 46 9.5310] Error: ''
All structure formatted data for an item will contain the ItemID, Value, Quality, and TimeStamp fields. Note The Error field in this example is specific to the read function, and is used to indicate any error message the server generated for that item. Structure Formatted Data for Multiple Items If you execute the read function with a group object containing more than one item, a structure array is returned.

rGroup = read(grp)

rGroup =

3x1 struct array with fields: ItemID Value Quality TimeStamp Error
In this case, the structure array contains one element for each item that was read. The ItemID field in each element identifies the item associated with that element of the structure array. Note When you perform asynchronous read operations, and for data change events, the order of the items in the structure array is determined by the OPC server. The order may not be the same as the order of the items passed to the read function. Structure Formatted Data for Events Event structures contain information specifically about the event, as well as the data associated with that event. The following example displays the contents of a read async event.

cleareventlog(da); tid = readasync(itm); % Wait for the read async event to occur pause(1); event = get(da, 'EventLog')

event =

Type: 'ReadAsync' Data: [1x1 struct]

The Data field of the event structure contains


ans =

LocalEventTime: [2004 3 11 10 59 57.6710] TransID: 4 GroupName: 'StructExample' Items: [1x1 struct]

The Items field of the Data structure contains


ans =

ItemID: 'Random.Real8' Value: 9.7471e+003 Quality: 'Good: Non-specific' TimeStamp: [2004 3 11 10 59 57.6710]
From the example, you can see that the event structure embeds the structure formatted data in the Items field of the Data structure associated with the event. Additional fields of the Data structure provide information on the event, such as the source of the event, the time the event was received by the toolbox, and the transaction ID of that event. Structure Formatted Data for a Logging Task OPC Toolbox software logs data to memory and/or disk using the data change event. When you return structure formatted data for a logging task using theopcread or getdata function, the returned structure array contains the data change event information arranged in a structure array. Each element of the structure array contains a record, or data change event. The structure array has the LocalEventTime and Items fields from the data change event. The Itemsfield is in turn a structure array containing the fields ItemID, Value, Quality, and TimeStamp. Back to Top

When to Use Structure Formatted Data

For the read, read async and data change events, you must use structure formatted data. However, for a logging task, you have the option of retrieving the data in structure format, or numeric or cell array format. For a logging task, you should use structure formatted data when you are interested in

The "raw" event information returned by the OPC server. The raw information may help in diagnosing the OPC server configuration or the client configuration. For example, if you see a data value that does not change frequently, yet you know that the device should be changing frequently, you can examine the structure formatted data to determine when the OPC server notifies clients of a change in Value, Quality and/or TimeStamp.

Timing information rather than time series data. If you need to track when an operator changed the state of a switch, structure formatted data provides you with event-based data rather than time series data. For other tasks that involve time series data, such as visualization of the data, analysis, modeling, and optimization operations, you should consider using the cell or numeric array output format for getdata and opcread. For more information on array formats, see Understanding Array Formatted Data. Back to Top

Converting Structure Formatted Data to Array Format

If you retrieve data from memory or disk in structure format, you can convert the resulting structure into array format using the opcstruct2array function. You pass the structure array to the function, and it will return the ItemID, Value, Quality, TimeStamp, and EventTime information contained in that structure array. The opcstruct2array function is particularly useful when you want to visualize or analyze time series data without removing it from memory. Because peekdataonly returns structure arrays (due to speed considerations), you can use opcstruct2array to convert the contents of the structure data into separate arrays for visualization and analysis purposes.


Beltrano M.C., Perini L. Ministry of Agriculture and Forestry - Central Office for Crop Ecology Via del Caravita 7/A Rome, Italy phone:+3906695311; fax+390669531215; cbeltrano@ucea.it, lperini@ucea.it

ABSTRACT Central Office for Crop Ecology (UCEA) is the Italian governmental organism which works as National Agrometeorological Service and Research Centre for Agrometeorology. The activities of UCEA began in the last century (1876) but, regarding the specific field of the meteorological monitoring, the national agrometeorological network of automatic weather stations (RAN) was developed only in the last fifteen years. At present the RAN is composed of 31 automatic weather stations, integrated with the National Meteorological Service network and connected with UCEAs elaboration centre. In the next future the number of automatic stations will be increased to obtain a better monitoring. Data quality check is reached through computerised procedures and programmed maintenance of sensors. The aim of this work is to show operational experience carried out by UCEA in data quality control. The activities are split in two principal levels. The first one is articulated in the following steps: detection of outlier measures, detection of climatic range measures, detection of impossible measures, detection of out limit measures. The second one is articulated in the following steps: check of time persistence, check of spatial homogeneity, check of consistency among correlated variables. Every data is stored with a validation code which allows the final user to identify data correctness.


Central Office for Crop Ecology (UCEA), as operative institute of the Italian Ministry of Agriculture and Forestry, performes its activity of national agrometeorological Service dealing with environment and agriculture monitoring. UCEA holds and manages the national agrometeorological network which includes, at the moment, 31 automatic weather stations installed in significant agricultural sites. In the next future there will be a progressive increase of the number of weather stations in order to improve the agrometeorological monitoring. Since a meteorological Service is responsible for the acquired data, it is very important to respect several essential requirements in order to obtain reliability and comparability among

meteorological measurements and a necessary:

good quality of data. For this reason it is generally

To respect the rules for weather station installation and meteorological observations: rules are provided by WMO and should be observed to allow the comparability of measures obtained through several stations or Services. The accuracy of the measure is linked to the applications. The WMO has provided several Reference Tables which specify the requirements of meteorological measurements in the different fields of activity. To effect maintenance and fast devices failure repair: Regular maintenance involves the activities predisposed to ensure the correct working of the station and it is scheduled every six months. Each sensor is replaced every year. Additional maintenance (on request) is applied to solve unexpected hardware and software problems. To record detailed metadata To check data quality

The data quality control adopted by UCEA follows its own criteria because there is not still a conventional validation process established at national or international level. Data obtained from agrometeorological stations are validated through an automatic procedure which consists of several checks (range, outliers, consistency, congruence, persistency). Such a control panel allows to rapidly check wrong data and to organize appropriate actions to reduce each measuring error in real time. The central acquisition system provides a daily validation report in which are enclosed the whole problems, identified by appropriate codes, recorded during the process of data acquisition and transmission. Such process allows to verify in real time, through automatic algorithms, the correctness of acquired data as well as an ensamble of daily report data (e.g., minimum and maximum values, total values). Such procedure also allows to check the correct working of weather stations. After preliminary check, the whole dataset is further validated before collected in the main agrometeorological database (BDAN). Wrong or suspicious data are generally not corrected at this step, but they are described by a code (flag). Flag 0 is associated to correct data. A monthly manual check is also carried out to confirm the previous validation results before final data recording. Generally these last checks are effected comparing several graphs in order to identify anomalous trends.


Acquired data have a format that includes the following information: Station code Parameter code Date Measure (value) Validation index (=NULL). First level of data control The first level of data control includes three kind of checks: Range check : measures are compared with a range between extreme values. This check recognizes only wrong values. For example: temperature is compared with specific monthly variation connected to latitude and altitude of the weather station. Comparison with expected values: measures are compared with expected calculated values. For example, global radiation is hourly and daily compared with astronomical radiation (at the atmosphere limit). Congruence test: measures of correlated parameters are compared to verify reciprocal coherence. (e.g., minimum and maximum temperature vs hourly temperature; cloudiness vs rainfall, etc.). The initial flag (=NULL), after the first level control, can get a value =1 for correct data or 0 for souspicious or wrong data according to a specific codification. In case of wrong data the check is stopped and data status is B (=Blocked). In case of suspicious data, the check will continue and data status is A (=Alarm). The weather station dataloggers are hourly inquiried through automatic procedures and a preliminary database is realized through downloaded data. Starting from that dataset we have the following chronological steps: To check impossible value: uncorrect data are marked with flag =11 To check outlier value (lower than sensor range): uncorrect data are marked with flag =12 To check outlier value (higher than sensor range): uncorrect data are marked with flag =13 data are marked with flag =14 To check wrong observation hour: uncorrect

To check impossible observation time: uncorrect data are marked with flag =16 To check impossible observation (e.g. nocturnal sunshine): uncorrect data are marked with flag =17 To check value lower than calculated measure: uncorrect data are marked with flag =18 To check value higher than calculated measure: uncorrect data are marked with flag =19

Value lower than climatic threshold : uncorrect data are marked with flag = 34 Value higher than climatic threshold: uncorrect data are marked with flag = 35 Data which have not passed physic consistency checks are considered wrong data and they will be associated to their pertinent validation index to mark them from other data (block index = B). Data which have not passed the climatological consistency checks, are considered sospicious data and they will be associated to their pertinent validation index and will be re-tested during the second level validation (alarm index = A). Data which have passed the above checks are ready for the second level validation.

Second level of data control

The second level of data control is composed of daily checks to verify the congruence among measured data in different time of the day and daily ensamble data (total, minimum and maximum). The procedures effect also tests of time and spatial correlation and examine all the data that correctly passed the first level validation and all the data that did not pass the climatic range check (flag = 34 and 35). Time persistency control: the temporal evolution of the meteorological parameters is verified comparing time contiguous measures. The persistence range limit depends on the kind of data; the maximum number of consecutive missing data allowed in the test is showed in the following table:


Max number of fixed values 5 60 12

Max number of consecutive missing data 0 4 4


Wet leaf Wet leaf Wet leaf

0 e 60 =0 = 60

Wind direction (2m; 10m) Atmospheric pressure Rainfall Rainfall Rainfall Global radiation Global radiation Air temperature Soil temperature Relative humidity Relative humidity Wind speed (2m;10m) Wind speed (2m;10m)

5 10 5 3600 20 5 20 5 10 5 12 5 24

0 2 0 10 2 0 2 0 0 0 4 0 2

for wind > 0,5 0 e 0,2 =0 = 0,2 0 =0 all values all values 100 = 100 0 =0

We have the following steps: data are marked with flag= To check fix values: souspicious or wrong 41.

To check instantantaneous value lower than hourly minimum: souspicious or wrong data are marked with flag= 42. To check instantaneous value higher than hourly maximum: souspicious or wrong data are marked with flag= 43. To check anomalous gradient: souspicious or wrong data are marked with flag= 44.

To check incongruent value with the previous data: souspicious or wrong data are marked with flag= 45. To check instantantaneous value lower than daily minimum: souspicious or wrong data are marked with flag= 46. To check instantantaneous value higher than daily maximum: souspicious or wrong data are marked with flag= 47. To check anomalous persistence under the minimum threshold: souspicious or wrong data are marked with flag= 48. To check anomalous persistence over the maximum threshold: souspicious or wrong data are marked with flag= 49. To check no coherence among associated variables: souspicious or wrong data are marked with flag= 50. The procedures compare different data (e.g. relative humidity, rainfall and wet leaf) simultaneously measured from the same station.

cross-tabulation data sheet:

The check is performed through the following

Humidity < 50 % >95% < 40 %

Rainfall > 0,4 mm in 6 surveis > 0,4 mm in 6 surveis = 0 mm in 6 surveis

Wet leaf 60 min. 0 min. 60 min.

Congruence No No No

The data status for each positive verification will be B (blocked). Only for data with flag = 45 (incongruent value with the previous data) the data status will be A (alarm) and those data will continue the control.

Spatial homogeneity control: for parameters with a clear spatial correlation (e.g. temperature) the procedures select data measured at the same time from a group of adjacent stations (at least 5 stations) with maximum distance among them included in a radius of about 100 Km and belonging to the same altitude range (maximum difference of altitude admitted is about 400 meters). Wrong data are associated to flag = 62.

Optional checks (wind direction): the procedure checks the persistence of wind direction from a particular direction sector and it identifies eventual sensor failures or eventual obstacles around the weather station. The sospicious/wrong data are associated to flag =63. Data which have passed all the quality checks will be associated to flag = 0. Third level of data control The main goal of the third level control is to check anomalous trends or systematic errors not automatically identified. Those controls are performed through visual analysis of the following data graphic layout: air temperature at 5 cm, 50 cm and 2 m; barometric pressure; wind speed at 2 m and 10 m; air humidity at 50 cm and 2 m; sunshine and global radiation. CONCLUSIONS At the end of data validation, all agrometeorological data, correct (with flag=0) or wrong/sospicious (with flag 0), are collected in the database, disseminated through the Internet and available at www.ucea.it. The flag associated to each data allows users to discriminate the quality of data and to choose which data should be included in their elaborations.