Basics of XML

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

Over the course of this article, we shall explore what XML is, its uses and advantages and then, basic syntax as well how XML documents are written. Let's begin!

Table of contents:

  1. What is XML?
  2. Advantages of XML
  3. Versions of XML
  4. Getting started
  5. XML Vocabularies
  6. XML parsers
  7. Validation of XML
  8. XML Namespaces
  9. Applications of XML
  10. Alternatives to XML

What is XML?

XML stands for Extensible Markup Language. It was developed to encode data and information in a format that could be easily interpreted by humans and computers alike. The syntax of XML is very similar to HTML (Hypertext Markup Language). XML can be used for sharing data between different applications and services across the Internet and thus, many APIs are XML based, due to its simplicity as well as its popularity.

Advantages of XML

  • XML is beginner-friendly since it uses human language and not computer language to encode data. This makes XML files very accessible in terms of both reading and writing.
  • XML is extendable, allowing users to create their own tags and define their data as needed. So, they can add specific attributes and details as needed for their usage.
  • XML is portable, so any applications which can process XML can easily make use of the data without any worries of platform compatibility.

Versions of XML

There are only versions of XML yet:

  • XML 1.0: This is the first and the most popular version of XML. It has undergone five minor revisions and remains the most widely used version of XML.
  • XML 1.1: This was introduced in 2004 with the aim to optimise XML in some specific cases. It is not as popular as XML 1.0 and is only used when its optimisations are required.
  • XML 2.0: No active developments have been undertaken for XML 2.0 yet, but some proposals exist outlining the major issues that can be handled.

Getting started

Here is a sample XML document.

<?xml version = "1.0"? encoding = "UTF-8"> <!-- 1 -->

<person> <!-- 2 -->
    <firstName>Kate</firstName> <!-- 3 -->
    <lastName>Bishop</lastName> <!-- 4 -->
    
</person>

In this document, we first begin with the XML declaration, which is line 1. This declaration identifies the document as XML. We use the version as 1.0, since it is the current standard for XML and widely supported.

Then, we come to the <person> tag. XML, like HTML uses tags which are enclosed in angular brackets '<' and '>'. Every XML element comes in pair of tags i.e. a starting tag (ex. <person>) and an ending tag (ex. </person>). To distinguish the ending tag from the starting tag, we add the '/' before the tag name. Between the starting tag and the ending tag, we add the information. This information can be a string or a number, etc., or it can be a set of more tags nested inside the current tag (ex. <firstname>).

In every XML document, all tags except the declaration must be nested under a singular tag termed as the root tag. In the sample, <person> is the root tag.
Too add comments in XML, we start it with <!--, add the comment text and then close it with -->.

All XML tags are case-sensitive.

XML Vocabularies

Since, XML allows users to define their own tags, the scope of XML programs is wide. Some may proclaim it to be too wide, because developing systems which can work with these XML files must be able to understand these XML tags. This leads to creation of XML vocabularies, very similar to vocabulary in any language. Once, you define the words, you can make sentences. Building XML vocabularies depending on usage can help in defining the scope of XML documents to be used in a particular project and also, enable us to create programs for processing and understanding them.

XML parsers

XML documents are processed or parsed by XML parsers, which checks for syntax errors and extracts information from XML and makes it available for the application for use. XML syntax rules include

  • Having a starting and ending tag for every element
  • Proper nesting of elements within each other. <one><two> hello text </two></one> is valid but <one><two> hello text </one></two> is not. The tags for two must be completely enclosed in one.
  • For every element, the tags must be in the same case to be valid. <one>hi</one> is valid. <One>Hi</one> is not.
  • Existence of the root element.
  • All values added in attributes must be in quotes.

XML documents which adhere to XML syntax rules are said to be well-formed documents.

There are two main types of XML parsers:

  1. Document Object Model(DOM): The DOM parser represents the XML tags in the tree format with elements and attributes as nodes joined by edges starting at the root. While DOM is simple to use and supports reading and writing, it is slow and consumes more memory.

  2. Simple API for XML(SAX): This is an event driven protocol which processes the XML elements linearly from top to bottom, checks the document for syntax conformity and conveys the data to the calling application via event notifications. It is memory efficient and can be employed for large files. But it does not facilitate random access.

Validation of XML

When working on applications, XML documents may be required to stick a very specific format to be useful to the program. This may include the name as well the order in which tags appear in the doc and their proper nesting. So, in these situations, XML documents must first be validated.

This is done using Document Type Definitions (DTD). DTDs define the proper structure that XML document can now be tallied against to check for conformity. Documents which obey the structure in DTD are valid XML documents.

DTD files end with the extension .dtd and are included in XML files just after the XML declaration using the syntax:

<!DOCTYPE ID SYSTEM "names.dtd">

DTD files can be created as follows:

<!-- names.dtd -->
<!ELEMENT person ( firstName, lastName ) >

<!ELEMENT firstName  ( #PCDATA )>
<!ELEMENT lastName  ( #PCDATA )>

This is the case when we are using an external DTD file. It is also possible to include the DTD in the XML file itself. This can be done as follows:

<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?> 

<!DOCTYPE names [
<!ELEMENT person ( firstName, lastName ) >
<!ELEMENT firstName  ( #PCDATA )>
<!ELEMENT lastName  ( #PCDATA )>
]>

<person> 
    <firstName>Kate</firstName> 
    <lastName>Bishop</lastName> 
</person>

Here, in internal DTD declarations, we set the XML attribute standalone to 'yes', conveying that this document does not to need to refer to any external documents. For external documents, we set standalone to 'no'.

XML Namespaces

We know by now that XML lets users make their own tags. So, in cases of multiple users, this extendability may result in naming collisions. So, in order to find a way through which users could maintain this flexibility and still individually distinguish between elements, namespaces were developed. Every tag can only occur once in a namespace, but can appear in multiple namespaces. An XML namespace consists of element names and its attributes. So, if we need to differentiate between books of school A and school B, and yet add them in the same file, we create a different namespace for each school. This is done using the xmlns attribute as follows:

<?xml version = "1.0" encoding = "UTF-8">

<booklist 
    xmlns:SchoolA = "www.SchoolA.com"
    xmlns:SchoolB = "www.SchoolB.com">
    
<SchoolA:book>
    <SchoolA:name>All about XML</SchoolA:name>
    <SchoolA:pages>345</SchoolA:pages>
</SchoolA:book>
    
<SchoolB:book>
   <SchoolB:name>Best XML Techniques</SchoolB:name>
   <SchoolB:author>Mark ABC</SchoolB:author>
</SchoolB:book>
    
</booklist>    

Here, we use the xmlns attribute in the root tag and define the separate namespaces. The URLs we use define the domain of the namespace. Since the namespaces are in different domains, naming collisions are resolved. In actuality, these URLs aren't necessarily valid web pages. The string of URL just indicates a different domain. It could be any other random string and our results would remain unchanged. As long as the text in the "" is different, our namespaces shall remain differentiated.

Applications of XML

XML is commonly used to implement Service-oriented Architechture(SOA). Having seen the basics of working of XML, let's see some of its specific applications.

  • Improving Web search efficiency: Since XML defines the type of data contained a document, it greatly improves the relevance of search results, since instead of just matching the search keywords to all the files on the internet, XML also takes in the context into account and helps in delivering exactly what we are looking for. Ex. if we're looking for books by 'Danielle Steel', XML will help filter out all irrelevant results stemming from the word 'Steel' and thus, we get accurate results.

  • Web Publishing: Web pages can be made interactive using XML, enabling users to customise them according to their needs and making websites more intuitive.

  • Electronic Data Interchange (EDI): The electronic exchange of business-related information between two businesses/ trading partners using a standardized format is called Electronic Data Interchange. XML makes EDI transactions easy to implement and work with.

  • Implementing APIs: Various APIs(Application Programming Interfaces) such as SOAP(Simple Objects Access Protocol), XML-RPC(Extensible Markup Language Remote Procedural Call), REST(Representational State Transfer) use XML to enable communication between different applications.

Alternatives to XML

Some alternatives to XML are:

  • JSON: The common alternative to XML is JSON(JavaScript Object Notation). It's a light weight data interchange format which is easy to write by humans and can be easily parsed by the machine.
  • YAML: YAML stands for Yet Another Markup Language and is a human readable language for data serialization and is often used in configuration files.
  • Protocol Buffers: Protocol Buffers were developed at Google for serializing structured data. Protocol buffers were developed especially to be more lightweight and faster than XML.

Thus, through this article at OpenGenus, we have gained a solid introduction to XML. Keep learning!