Vous êtes sur la page 1sur 27

XML

AUDIENCE FOR THIS PRESENTATION


This presentation is intended for new users of XML

INDEX
Introduction to XML XML Parsers

INTRODUCTION TO XML

The objective of this session is to understand


The need for XML y Elements of XML
y

WHAT IS XML
XML stands for eXtensible Markup Language XML is a subset of SGML (Standard Generalized Markup Language) XML was designed to describe data. Usually is associated with a grammar / contract for conformity. XML is a W3C Recommendation.

Please Note: XML was not designed to DO anything. XML was created to structure, store and to send information.

WHY EXTENSIBLE
XML is a cross-platform, software and hardware independent tool for transmitting information XML tags are not predefined. You must define your own tags. EX :SVG (Scalable Vector Graphics), RSS (Really Simple Syndication / Rich Site Summary), XHTML, etc.

DIFFERENCE BETWEEN HTML & XML


XML was designed to carry data. XML is not a replacement for HTML. XML and HTML were designed with different goals: XML was designed to describe data and to focus on what data is. HTML was designed to display data and to focus on how data looks. HTML is about displaying information, while XML is about describing information.

XML SYNTAX RULES


A XML with correct syntax is Well Formed XML and XML validated against a DTD/Schema is Valid XML). Some of the syntax rules are: y XML documents use a self-describing and simple syntax y XML documents can have only one root element y Every start-tag must have a matching end-tag y Tags cant overlap (XML Elements Must be Properly Nested) y Element names must obey XML naming conventions y XML tags are case-sensitive y XML will keep white space in your text y XML Attribute Values Must be Quoted

XML ELEMENTS
Names can start with letters, but not numbers or other punctuation characters After the first character, numbers are allowed, as are the characters - and .. Names cant contain spaces Names cant contain the : character. ( exception Namespaces ) Names cant start with xml upper or lower or mixed. There cant be a space after the opening < character. Empty elements are elements that have no data. There should not be any space between / and > in empty elements.

XML ELEMENTS CONTINUED


An XML element is everything from (including) the element's start tag to (including) the element's end tag. Elements can have different content types. An element can have element content, mixed content, simple content, or empty content. An element can also have attributes. XML Elements have relationships. Elements are related to each other as parents and children.

ATTRIBUTES

Attributes are used to provide additional information about elements. Attribute values must always be enclosed in quotes, but either single or double quotes can be used. Attributes cannot contain multiple values (child elements can) Attributes are not easily expandable (for future changes) Attributes cannot describe structures (child elements can) Attributes are more difficult to manipulate by program code Attribute values are not easy to test against a Document Type Definition (DTD) - which is used to define the legal elements of an XML document Please Note: Metadata (data about data) should be stored as attributes, and that data itself should be stored as elements.

XML STRUCTURE
The first line in the document - the XML declaration - defines the XML version and the character encoding used in the document. The next line can be either the root element or a DTD/Schema definition. If you have a DTD/schema definition then the next non-blank line after the definition is considered as the root element. Following is an example of a XML document <?xml version="1.0" encoding="ISO-8859-1"?> <library> <book isbn="0475589067"> <title>XML and Java <title> <author>Hiroshi Maruyama<author> <pages>386<pages> <softcover/> </book> </library>

XML DECLARATION

Starts with <?xml and ends with ?> Version attribute is must and encoding and standalone are optional The version, encoding and standalone attributes must be in that order. XML declaration must be at the beginning of the file. That is, the first character in the file should be that <; no line breaks or spaces. EX: <?xml version=1.0?>

COMMENTS Starts with <!-- and ends with - - > -- is not allowed inside the comment XML specification states that an XML parser doesnt need to pass these comments on to the application, meaning that you should never count on being able to use the information inside a comment from your application.

PCDATA

& CDATA

Every element data by default in a XML is of type PCDATA (Parsed character data) PCData doesnt allow characters like < and &. To solve this problem we can use escaping characters like &lt, &amp, &gt etc. In-built entity references are : y &amp, &lt, &gt, &pos, &quot An element can also have CDATA. CDATA allows all the characters and tells the parser not to parse the text.

CDATA is declared as <![CDATA[ Text to be ignored ]]>


Ex. <note><![CDATA[Simon & Garfunkel]]</note>

ASSOCIATING DTD WITH XML


DOCTYPE declaration associates a DTD with an XML document instance. Each XML document can be associated with one, only one DTD using a single DOCTYPE declaration. If the DTD is not found, the parser will send an error message, and will be unable to validate the document. <!DOCTYPE document_element source location1 location2 [ internal subset of DTD ] >

ASSOCIATING SCHEMA WITH XML


Schema are declared with the associated element A XML document can be associated with many schemas but each schema must be associated with a unique namespace prefix. <prefix:element xmlns:prefix=namespaceURI> If you want to associate with a file then <prefix:element xmlns:prefix=namespaceURI xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation=http://example.org/note.xsd"> <prefix:element xmlns:prefix=namespaceURI xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation=file:///c:/note.xsd">

DEFAULT NAMESPACES

A default namespace works much like a regular namespace, except that you dont have to specify a prefix for all of the elements that use it. <person xmlns=http://spmer.com/pers> Within an element and all its descendants/children you can have only one default namespace declaration. If you wish to add another namespace in the descendants, it must be prefixed. y <person xmlns=http://spmer.com/pers> <name/> <xhtml:p xmlns:xhtml = http://www.w3.org/1999/xhtml> This is XHTML</xhtml:p> </person>

DTD VS SCHEMA
DTD 1. EBNF+ pseudo XML 2. No DOM support 3. Content models weak simple sequence or choice lists, cant use these with mixed content 4. Data typing weak Strings, name tokens, IDs, and little else 5. Name scope global SCHEMAS 1. Good old XML1.0 2. Can be displayed and manipulated with DOM 3. Can specify exact number of occurrences; 4. Strong string, numeric, Date/time etc. 5. Global and local names

XML PARSERS
Objective of this session is to understand DOM Parsers SAX Parsers

XML PARSERS

Parsing basically means breaking down or dissecting a file into functional components. An XML parser is a processor that reads an XML document and determines the structure and properties of the data. It breaks the data up into parts and provides them to other components. XML Parsers can be classified into 2 Broad Categories.
y y

SAX ( Simple API for XML ) DOM ( Document Object Model )

SAX VS DOM
SAX SAX parsers process the XML document sequentially DOM DOM parsers typically load the entire document into memory and store it in a tree structure. Slow compared to SAX as forms a TREE and traverses it. Can Traverse Randomly and access any node of XML Document Easier to create and traverse.

SAX is fast due to its sequential reading pattern. No Random Access of the XML Document. The element being parsed has no knowledge of past or future elements. SAX development is also more complex than DOM, because you have to write a SAX handler to interpret the SAX events. For some quick intensive programming SAX is a better choice. Also when no TREE Manipulation is needed SAX is a better choice. A SAX Parser reads through the Input XML document, and notifies us of any interesting events.

Parsing with DOM can become very memory Hungry and can dramatically decrease the performance of your application when dealing with large documents. A DOM Parser wherein works by forming a TREE Structure, typically called DOM TREE.

WHAT IS DOM

DOM is document object model. Elements become objects and some elements become properties of those objects. Its a tree structure based API which allows the user to access and manipulate the XML document If we are writing code to deal with an order, this object model would make it easier to process that information, and would probably even include methods to provide the functionality for us. XML XML Parser DOM Application

SAX PARSER

SAX is an industry standard API intended for high performance XML document processing. SAX works on Event Driven Processing. SAX not a W3C Recommendation. Instead it is a public domain software created by David Megginson. SAX was developed keeping Java in mind but latest implementations supports many other languages. SAX is an API, providing interfaces for parsers to implement. Therefore SAX itself cannot be used to parse documents. Parsing requires SAX parsers, which are parsers conforming to SAX specifications; in other words they implement the interfaces defined by SAX.

WORKING WITH SAX


SAX gives us access to the information in the XML document, not as a tree of nodes, but as a sequence of events. All that SAX parser does is read through the XML documents and fire some events depending on the tags encountered. There are callback methods for each event which are implemented by the user. For Example: y startDocument() y endDocument() y startElement() y endElement() y characters()

use strict; use warnings; use XML::Simple; my $cust_xml = XMLin("./customers.xml", forcearray => 1); for my $customer (@{$cust_xml->{customer}}) { foreach (qw(first-name surname)) { $customer->{$_}->[0] = uc($customer->{$_}->[0]); } } print XMLout($cust_xml); print "\n";

QUESTIONS

Vous aimerez peut-être aussi