Vous êtes sur la page 1sur 44

C# and Windows

Programming
XML Processing

Contents
Markup
XML
DTDs
XML Parsers
DOM

Markup

When we write text, it is just text


For example:
John

Smith
123 Main St.
Toronto
Ontario

We can all read this and understand it


A computer cannot and needs additional
information
3

Markup
Markup is added to documents in the form
of tags
A tag consists of text delimited by angle
brackets
The name of the tag identifies it and the
information which is conveyed by the tag

Markup

Lets add some semantic markup to our address


<address>
<name>John Smith</name>
<street>123 Main St.</street>
<city>Toronto</city>
<province>Ontario</province>
</address>

This identifies the information in the various


parts of the address
5

Markup

You will notice


Tags

occur in pairs

A start tag
A matching end tag with a / before the tag name

The

text that the tags are describing is enclosed


between the start tag and the end tag
A single tag is placed around the entire document
The fact that every start tag has a matching end tag
makes the document well-formed
6

XML
XML is one of in a long line of markup
languages
It is the eXtensible Markup Language
Unlike, other markup languages, you can
define your own tags
Any meaning associated with those tags is
imposed by your program

Uses of XML

SOAP
Simple

Object Access Protocol a type of


remote procedure call

Configuration files
Web services
Security information
Electronic document exchange

Defining Documents

If you can define your own tags, how do


you know what should be in a document?
Document

Type Definition

This defines the allowable tags and their order

Schema

Like a DTD, it describes the tags and their order


It also describes the content which can be placed
within the tags

XML Structure

Here is a simple XML document


<?xml version=1.0 encoding=ISO-8859-1
standalone=no?>
<!DOCTYPE address SYSTEM address.dtd>
<address>
<name>John Smith</name>
<street>123 Main St.</street>
<city>Toronto</city>
<province>Ontario</province>
</address>
10

Attributes

A tag can also have attributes which


provide additional information about the
tag
<city

size=large>Toronto</city>

A tag can have zero or more attributes

11

The XML Declaration


The first line is the optional XML
declaration
It consists of

<?xml

Identify this as the XML declaration

version=1.0

The version of XML in the document

12

The XML Declaration


encoding=ISO-8859-1

This is the character set used in the document


Various character sets can be used including unicode (UTF8) an international character set

standalone

= no

Determines if the document uses any external entities which


are defined in other files
This will be discussed later in the course

In

general, the order of attributes is not important but


it is in the XML declaration
13

The DOCTYPE Declaration

The optional DOCTYPE declaration


follows the XML declaration
<!DOCTYPE

address SYSTEM
address.dtd>

This declaration is required only if you


want to validate the document against a
definition of the tags in the document
14

The Root Element


This is the <address> element which
begins the document
It is the first element in the document
It contains all other elements in the
document

15

Elements

An element consists of a start tag, character


data, and an end tag
<name>John

Smith</name>

A tag name must start with a letter or underscore


A tag name cannot contain spaces or colons
The end tag must match the start tag exactly,
including case

16

Mixed Content

If an element contains just text, it has


simple content
<name>John

Smith</name>

If it contains a mix of text and elements, it


is said to have mixed content
<sentence>these

are nested
<adverb>correctly</adverb></sentence>
17

Attributes

Attributes are name-value pairs which can be


added to elements
Attributes allow you to provide additional
information without changing the tag itself
The names for attributes follow the same rules
as tag names
Every attribute name within the same tag must
be unique
18

Attributes
<employee name=Jones>accountant<employee>
<employee name=Smith>sales<employee>

Note that these both contain a name


attribute
That is OK since the attributes are in
separate elements
Attribute values are placed in either single
or double quotes

19

Comments

Comments are delimited by spacial


brackets
<!--

a comment -->
Comments can
Add explanations
Remove XML which is not needed for a while

20

Entities

The less than and greater than signs delimit tags


What if you want to type these symbols in a document
and not have them delimit a tag?
Then, enter them as entities
To enter a less than sign

&lt;

All entities are referenced using

&
The entity name
;

21

Entities
Entity

Symbol Description

&lt;

<

Less than

&gt;

>

Greater than

&amp;

&

Ampersand

&quot;

Double quote

&apos;

apostrophe
22

CDATA
Sometimes using entities is not enough
since you have many special characters to
type
A CDATA section allows you to enter
anything without having special characters
interpreted

<![CDATA[

any characters here ]]>


23

Document Type Definitions

The DTD is one way to describe what should be


in a valid XML document
There are other ways which we will examine
later in the course
A DTD
Describes

each element and the elements which can


occur within it
Describes the attributes for each element
Describes entities which can be used in the document
24

Person DTD
<!-- The DTD for person -->
<!DOCTYPE persontype [
<!ELEMENT person (first, last,
gender, employee-id) >
<!ELEMENT first (#PCDATA) >
<!ELEMENT last (#PCDATA) >
<!ELEMENT gender (#PCDATA) >
<!ELEMENT employee-id (#PCDATA) >
]>
25

Reading the DTD

There is an element person containing the elements

first
last
gender
employee-id

These element are described below


Each of these contains PCDATA, meaning parseable
character data
This means that these elements only contain text not
nested tags

26

XML Parsers

There are two types of XML parsers


DOM

The Document Object Model


This parses the document into a tree-like structure called a
DOM
The document is parsed all at once

SAX

Simple Api for Xml


This is a sequential parser which executes a callback when
each part of the document is recognized
This is good for very large documents since the entire
document does not have to be in memory at once

27

What is DOM?

DOM is an in-memory data structure


It describes an XML document as a tree
structure
The nodes in the tree are described by the
interface to them
This means that there can be many
implementations that implement the interface

28

So, how do make a document into


a tree?
<?xml version=1.0?>
<friend>
<handle degree=close>
Harold
</handle>
</friend>
whitespace

Document

Root
Element

friend

handle

degree

whitespace

Text

Harold

close

Attribute

29

Nodes
All nodes in a DOM implement the Node
interface
All other interfaces in the tree extend the
Node interface
This means that every node can be
treated as a Node, and maybe more

30

XmlNode
Represents every node in the DOM
Properties

ParentNode
Name
FirstChild
NextSibling
PreviousSibling
Value
31

XmlNode

Methods
InsertBefore()
AppendChild()
RemoveChild()
Clone()

32

XmlDocument

The node above the root node of the document


Can be used to represent an empty document
Properties

DocumentElement

Methods

CreateElement()
CreateTextNode()
GetElementsByTagName()
Load()
Save()

33

XmlElement

This represents an element


An element can have attributes
Properties
XmlAttributeCollection

Attributes

Methods
GetElementsByTagName()
SetAttribute(string

name, string value)


string GetAttribute(string name)
34

XmlAttribute
This is an attribute
Can have either Text nodes or
EntityReferences as children
Name property gets the name
Value gets the value

35

XmlText
This is the node representing text
The text has no markup
Even whitespace is represented as a text
node

36

CDATASection Interface
This is a CDATA section
It is similar to a text node but the content
undergoes no interpretation

37

Other Node Subinterfaces


Comment
Notation
Entity
EntityReference
ProcessingInstruction

These

are all just the same as in XML

38

Other Node Subinterfaces

DocumentFragment
Part

of a document tree which can be inserted


into another tree

DOMImplementation
Prevides

capabilities of the implementation


Has the method for creating a document

39

Other Node Subinterfaces

DOMException
Something

NodeList
A list

went wrong

of nodes which has an iterator

NamedNodeMap
A map

structure holding a collection of nodes

40

Common .NET DOM Classes


XmlNode

XmlDocument

XmlElement

XmlText

XmlAttribute

41

XmlNodeList
A list of nodes
Returned by GetElementsByTagName()
Properties

Count

-- number of nodes in the list


Indexer
-- retrieves a node

Methods
Item(int

n)

-- retrieves a node
42

XmlNamedNodeMap

A map of nodes indexed by name


Superclass of XmlAttributeCollection
Returned by the Attributes property
Properties
Count

Methods
Item(int

n)
GetNamedItem(string name)
43

Examples
* see NodeLister
* see DocBuilder

44

Vous aimerez peut-être aussi