+4 votes
370 views
in Web development by (242k points)
Xpath tutorial for beginners

1 Answer

+5 votes
by (1.6m points)
selected by
 
Best answer

What is Xpath?
How does Xpath work?
Node types
Location routes
Additional information about XML Path Language
Element node
Document node
Attribute node (attribute node)
Text node
Namespace node
Processing instruction node
Comment node
Axles
Node test
Preached

image

Xpath tutorial for beginners

With the expansion of XML as a markup language for cross-platform data exchange, the need for a standard that would allow non-XML-based applications to perform complex queries on XML documents increased..

Note

Extensible Markup Language (XML) is a language used to represent hierarchically structured data in text format. The XML language is readable by both humans and computers and is used, among others, to exchange data between two computer systems on the Internet.

With XQuery and XSLT, the W3C developed the necessary standards for software access to XML documents. They make programming interfaces available for applications to access and transform XML documents or query content. The necessary condition in these cases is a standard that allows the location of elements in XML documents, that is, a path description language: Xpath..

Next, we explain the Xpath Data Model (XDM) and introduce the syntax on which Xpath expressions are based for locating XML elements.

Index
  1. What is Xpath?
  2. How does Xpath work?
  3. Node types
    1. Element node
    2. Document node
    3. Attribute node (attribute node)
    4. Text node
    5. Namespace node
    6. Processing instruction node
    7. Comment node
  4. Location routes
    1. Axles
    2. Node test
      1. Node name as filter criteria
      2. Node type as filter criteria
      3. Wildcard node test
      4. Abbreviated notation
    3. Preached
      1. General predicates
      2. Numerical predicates
  5. Additional information about XML Path Language

What is Xpath?

XML Path Language (Xpath) is a path description language for XML documents developed by the W3C. This language makes available to users a non-XML-based syntax that allows finding specific elements in an XML document..

Xpath is generally built into a host language that allows processing of XML elements. XQuery, for example, is used to extract information from XML elements located with Xpath. XSLT, for its part, uses the query language in the transformation of XML documents.

  • Xpath: navigation in XML documents
  • XQuery: queries to XML documents
  • XSLT: XML data transformation

The current Xpath 3.1 version is specified in the W3C recommendation of March 21, 2017.

Note

Despite the updates, many XSLT processors, web browsers, and applications only support the 1999 Xpath 1.0 standard.

How does Xpath work?

Xpath is based on a data model that interprets the XML document as a sequence of elements arranged in a tree structure . This structure of the Xpath data model is comparable to that of the Document Object Model (DOM), which operates as an interface between HTML and JavaScript in the web browser.

Localization of XML elements is done based on the Unix registry system, in the form of path expressions. The basic elements of these locate paths are nodes, axes, node tests, and predicates.

Node types

Each of the elements of an XML structure is called a node. The classification of the nodes is defined both by the order of appearance in the document and by the interlacing of the XML elements.

The Xpath data model distinguishes seven types of nodes with different functions:

  • Element node
  • Document node (root node) (as of XPath 2.0; formerly called? Root node?)
  • Attribute node (attribute node)
  • Text node
  • Namespace node
  • Processing instruction node
  • Comment node

Let's look at the node types of the Xpath data model through an example. The following XML document serves the exchange of data in a book order and contains the seven types of nodes.

  <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE Order SYSTEM "order.dtd"> <?xml-stylesheet type="text/css" href="style.css"?> <!--That is a comment!--> <order date="2019-02-01"> <address xmlns:shipping="http://localhost/XML/delivery" xmlns:billing="http://localhost/XML/billing"> <shipping:name>Ellen Adams</shipping:name> <shipping:street>123 Maple Street</shipping:street> <shipping:city>Mill Valley</shipping:city> <shipping:state>CA</shipping:state> <shipping:zip>10999</shipping:zip> <shipping:country>USA</shipping:country> <billing:name>Mary Adams</billing:name> <billing:street>8 Oak Avenue</billing:street> <billing:city>Old Town</billing:city> <billing:state>PA</billing:state> <billing:zip>95819</billing:zip> <billing:country>USA</billing:country> </address> <comment>Please use gift wrapping!</comment> <items> <book isbn="9781408845660"> <title>Harry Potter and the Prisoner of Azkaban</title> <quantity>1</quantity> <priceus>22.94</priceus> <comment>Please confirm delivery date until Christmas.</comment> </book> <book isbn="9780544003415"> <title>The Lord of the Rings</title> <quantity>1</quantity> <priceus>17.74</priceus> </book> </items> </order>  

Element node

In the Xpath tree, each of the elements that make up the XML document constitutes an element node, with the exception of the XML declaration and the definition of the document type, located at the beginning of the document.

XML declaration:

  <?xml version="1.0"? encoding="utf-8"?>  

Document type definition (DTD):

  <!DOCTYPE Order SYSTEM "order.dtd">  

Item nodes start with a start tag and end with an end tag and are typically interlocking.

The first element in the document is called the root element .

In the previous example, the root element is the order node , which acts as the parent element of the subset of address , comment and items nodes , which, in turn, also contain child elements.

Document node

The document is the root of the tree structure. In the XML document, this node is not visible or represented by text; Rather, it is a conceptual node that encloses all the elements of the document. The child elements of the document node are both the root element and, if necessary, the processing instruction node and the comment node.

Attribute node (attribute node)

The attribute of an XML element is represented in the Xpath data model as an attribute node. Each of these nodes consists of an identifier and a subordinate value.

In the sample code, the first book element contains the isbn attribute with the value 9781408845660 .

  <book isbn="9781408845660">  

The attribute node belongs to the element node; however, he is not considered a son.

Text node

The characters encapsulated between the tags of an element node called text node.

In the example, the title element node contains the text node Harry Potter and the Prisoner of Azkaban .

  Harry Potter and the Prisoner of Azkaban  

Namespace node

The element and attribute names of a properly constituted XML document are assigned a namespace. Usually this mapping is set by the DTD at the beginning of the document.

If elements or attributes with different namespaces are used in an XML document, the namespace in question will be explicitly defined using the xmlns attribute or the xmlns prefix in the start tag of the referred element. The value of the xmlns attribute must be a Uniform Resource Identifier (URI) that indicates which namespace an element is mapped to. Assigning a namespace to an xmlns prefix can be done with both the element and all children. In the tree structure, each of the namespaces corresponds to a namespace node.

In the sample code, two different namespaces have been defined for the XML address element : xmlns: shipping and xmlns: billing. The children of the address element have the corresponding prefix for their classification.

  <address xmlns:shipping="http://localhost/XML/delivery" xmlns:billing="http://localhost/XML/ billing"> <shipping:name>Ellen Adams</shipping:name> <shipping:street>123 Maple Street</shipping:street> <shipping:city>Mill Valley</shipping:city> <shipping:state>CA</shipping:state> <shipping:zip>10999</shipping:zip> <shipping:country>USA</shipping:country> <billing:name>Mary Adams</billing:name> <billing:street>8 Oak Avenue</billing:street> <billing:city>Old Town</billing:city> <billing:state>PA</billing:state> <billing:zip>95819</billing:zip> <billing:country>USA</billing:country> </address>  

Ultimately, the xmlns prefix makes it possible to clearly differentiate elements from different namespaces, so, in the example, the street element with the shipping prefix contains the information about the shipping address, while the street element with the billing prefix does reference to the billing address.

Processing instruction node

Processing instructions for XML documents are outside of the document tree and are called processing instruction nodes. The processing instruction node begins with <? and ends with ?> .

In the example above, we find the following processing instruction:

  <?xml-stylesheet type="text/css" href="style.css"?>  

The syntactic construction of the XML declaration at the beginning of the file is similar to that of a processing instruction, although it is not considered a processing instruction node in the sense of the Xpath data model.

Comment node

In an XML document, Xpath processes the content marked as a comment as a comment node. This node includes only the highlighted characters and not the markup.

In the example above, we find the following comment node:

  That is a comment!  

Location routes

The search for nodes is carried out with the help of so-called location paths. A locate path is an Xpath expression used to navigate through the tree and select the desired set of nodes.

Locate paths are analyzed from left to right and can be absolute or relative . An absolute path begins at the document node and is specified by the forward slash (/). In contrast, relative location paths can start at any other node within the tree structure, in which case the starting point is called a context node.

The path expression consists of steps separated by slashes (/), similar to addressing files in a registry system.

Each step in a path expression consists of up to three parts: axis, node test, and any number of predicates.

  • Axis: The axis determines the direction of navigation in the tree structure from the context nodes or the document nodes.
  • Node test : the node test is a filter with which a set of nodes is delimited among all those located on the axis.
  • Predicates: Predicates provide the opportunity to filter the selected nodes again by axis steps and node testing.

The steps for locating an Xpath expression are defined according to the following syntax:

  Eje::prueba de nodo[predicado1][predicado2]?  
Notation Function
/ Acts as separator of the steps of a route
:: Acts as a separator between the axis and the node test

Axles

The Xpath syntax allows navigating through the following axes.

Axis Denomination in Spanish Selected nodes
child child All directly subordinate child nodes
parent parent node The parent nodes immediately above
descendant descendant All descendant nodes of the context node
ancestor * ancestor All nodes above the context node
following next nodes All nodes that appear after the document node, with the exception of descendant nodes
preceding * previous nodes All nodes before the context node, with the exception of ancestors
following-sibling younger siblings These are all the nodes of the XML document that come from the same parent node
preceding-sibling * older siblings These are all the preceding nodes of the XML document that come from the same parent node
attribute attribute All attribute nodes of an element node
namespace namespace All namespace nodes of an element node; this axis is not in the specifications as of version 2.0.
self node in progress The context node in progress
descendant-or-self descendants and the current node All descendant nodes of the context node and the context node
ancestor-or-self * The context node and its ancestors All nodes ancestors to the context node and the context node
Note

The axes marked with an asterisk (*) are regressive applications that, according to version 1.0 of the Xpath specification, constitute an optional element that must not be supported by standard applications.

The following graphic shows a schematic representation of the most important axes in the Xpath data model, starting from the context node (in red).

image
The document tree is represented in its entirety by the five axes self, ancestor, descendant, preceding and following. In the graph, there are also the child and parent axes, which overlap with the descendant and ancestor axes. The letters indicate the order in the document.

The child :: axis , for example, returns all the child elements of context node D: the node set comprises nodes E, H, and I.

Node test

With the node test, you filter the set of nodes selected by the axis. According to the Xpath specification, there are two possible criteria.

  • Node name: filter the nodes on the axis that have a certain name.
  • Node type: select all nodes on an axis that share the same type.

Node name as filter criteria

In the example code above, all descendants of the document node that have the name book could be selected with the following path expression .

  /descendant::book  

However, if among all the element nodes with the name book, you want to filter only those with the isbn attribute , you need a path expression with two steps.

  /descendant::book/attribute::isbn  

Node type as filter criteria

If you want to define a node type as a filter criterion to select a set of nodes, use the following functions:

Selected nodes function
node () The node () function returns all nodes of the selected axis.
text () The text () function returns all text nodes on the selected axis.
comment () The comment () function returns all comment nodes on the selected axis.
processing-instruction () The processing-instruction () function returns all processing instruction nodes on the selected axis.
Note

Xpath 1.0 defines 25 functions, and as of Xpath 2.0, 111 functions are available for the description of location paths. You will find a compendium in the W3C Xpath and XQuery Functions and Operators 3.1 recommendation of March 21, 2017.

Wildcard node test

If you use the spacer * (asterisk) instead of node tests, all nodes that correspond to the main node type of the selected axis are returned. That is, if an axis contains an element node, this will be the main node type of the axis. This applies to all axes, with the exception of attributes and namespaces, since the main node type in these cases would be the attribute node and namespace node, respectively.

The following stream expression, for example, returns all the attributes of the current context node:

  attribute::*  

Abbreviated notation

For frequently used axes and locating steps, abbreviations have been defined that can be used in place of the full path expression in English.

Standard notation Abbreviation Example
child :: empty child is the standard axis and its description can be suppressed. Therefore, the path expression child :: book / child :: title would be abbreviated to book / title.
attribute :: @ The attribute axis, including periods, can be abbreviated with the @ symbol.
The path expression book / attribute :: isbn returns the isbn attribute node of the book element and is rendered in abbreviated notation as: book / @ isbn.
/ descendant-or-self :: node () / // The locate step / descendant-or-self :: node () / returns the document node and all descendants; its abbreviated form is //. Instead of / descendant-or-self :: node () / child :: item, type // item and the locate path returns all item nodes in the document.
parent :: node () .. The parent :: node () step returns the parent node of the context node and is abbreviated with ...
self :: node (). The self :: node () step returns the current context node and is abbreviated with. .

Preached

Predicates allow you to fine-tune the node search performed using the axis and node test.

They are one third of a locate step, are optional, and are included using square brackets. Filter criteria in brackets are formulated as expressions containing, among others, path expressions, functions, operators, and strings.

Xpath supports both general and numeric predicates.

General predicates

The general predicates filter the set of selected nodes through the y-axis of the node test, assigning a Boolean value ( true or false ) to each node, so that only nodes that meet the condition and have a true value are included in the result.

The formulation of expressions for general predicates is done by means of operators . These are used to select nodes with specific content or characteristics, such as, for example, all nodes with certain characters, an attribute or a preset child element (if possible, in a certain position).

In the following tables, you can see a summary of the available operators, which are classified into arithmetic, logical and comparison operators.

Arithmetic Operators Function
+ sum
- subtraction
* multiplication
div integer result of division
mod module
Comparison operators Function
= equal
! = not equal
<less than; in XSLT, you need the & lt;
> greater than; In XSLT, it is recommended to use the & gt;
<= less than or equal; in XSLT, you need the & lt;
> = Greater than or equal; In XSLT, it is recommended to use the & gt;
Logical operators Function
and logical AND operator
or Logical OR operator

In the following example, the predicate [title = "Harry Potter and the Prisoner of Azkaban"] limits the results to an element node named book, whose child element title contains the character string Harry Potter and the Prisoner of Azkaban .

Note

The example uses the Xpath 3 syntax, which, if necessary, may not be supported by all online tools. The query examples that we present here are easily seen with the following online tester : http://videlibri.sourceforge.net/cgi-bin/xidelcgi.

  /order/items/book[title="Harry Potter and the Prisoner of Azkaban"]  

Here, we have selected the book item node , which contains the data for the Harry Potter book.

  <book isbn="9781408845660"> <title>Harry Potter and the Prisoner of Azkaban</title> <quantity>1</quantity> <priceus>22.94</priceus> <comment>Please confirm delivery date until Christmas.</comment> </book>  

Another child element of this element node is comment . The path expression would only have to be extended with two more steps if we want to select its content.

  /order/items/book[title="Harry Potter and the Prisoner of Azkaban"]/comment/text()  

With the comment step (short for child :: comment ) we navigate to the child of the book element of the same name and select its text node with the text () function . This corresponds to the following string of characters:

  Please confirm delivery date until Christmas.  

When a single path expression is used in a predicate, it is a proof of existence . With the path expression shown below, it was checked whether the above XML document contains one or more nodes with the name of comment .

Abbreviated notation:

  //book[comment]  

Standard notation:

  /descendant-or-self::node()/child::book[child::comment]  

The // book [comment] path returns all nodes with the name book that have a child element named comment .

Numerical predicates

Numeric predicates allow you to locate nodes by indicating their position . The path expression below returns the second node based on its position in the document; the node name is book .

  //book[2]  

Strictly speaking, the predicate [2] is an abbreviation of [position () = 2] . With this, Xpath first selects all nodes with the name book and then filters the nodes where the position () = 2 function   has a boolean value of true .

Note

Unlike other programming languages, Xpath enumerations begin with 1.

Additional information about XML Path Language

On the W3C website, you will find an overview of the development levels XML Path Language is at, as well as all standards and drafts.

Similarly, you will find free tools and information for using Xpath with web applications at MDN Web Docs and on the Microsoft Developer Network.


...