XPath, which stands for XML Path Language, is used to access data stored in XML files. XML files are used to store data, and its XML format is also used as the basis for HTML format today. As HTML format is used in web pages, XPath can also be used to access data from web pages.
The HTML syntax consists of a system of tags and attributes that dictate how the content should be displayed. The tags describe in general terms how the content should be displayed. The attributes further refine the formatting and also include ways to identify the specific data stored in the given tag.
Here are some examples of the HTML tags with their corresponding XPaths for the different types of data stored in a webpage:
Page title: //title
Meta description: /html/head/meta[@name='description']/@content
H1: //h1
H2: //h2
H3: //h3
Links: //@href
Tables: //table[n]/tbody/*
Bulleted Lists: //ul/li
Numbered Lists: //ol/li
Often HTML pages will use a certain tag multiple times and when you use it with the IMPORTXML function, you can get more data than you need. In constructing the XPath, you can include the tags enclosing the data with their unique attributes so that you can filter the output to the ones you need.
If you want to include the class attribute of the tag, you add
[contains(@class, ‘class_value’)],
Where the class_value is the class value attribute. For example, if the data we want to include is inside the following code:
<div class=”toc”>
<ul>
<li>Entry one</li>
<li>Entry two</li>
<li>Entry three</li>
.
.
.
</ul>
</div>
Then the XPath is
//div[contains(@class,'toc')]/ul/li/*
Google Sheets will identify the div tag that contains the attribute class with value “toc”, then go further to the ul and li tags to extract the specific contents for each li tag.
The data in websites come in different formats. They can be in the form of
For example, we want to extract the list of spreadsheet software from this Wikipedia page:
Once we identified the data, we then right-click the data we want to import then select Inspect. The Inspect option allows you to view the HTML code used to build the webpage. The Inspect option also centers the code to the portion containing the data.
The Inspect tab shows that the items are stored in <li> tags.
When you move your cursor through the HTML code of the webpage, you will notice that the highlighted part of the webpage changes. You can use it to identify the topmost tag containing the data you need. For our example, it is the div tag with the class mw-content-ltr mw-parser-output.
Additionally, we just want the name of the software. Fortunately, all of them are hyperlinks.
We can include the <a> tag in the XPath.
We can then construct the XPath by taking note of the div class and the list tags present. The XPath becomes:
//div[contains(@class,'mw-content-ltr mw-parser-output')]/ul/li/a
The formula to use is
=IMPORTXML(URL, xpath)
Where URL is the URL of the webpage while xpath is the XPath we use to extract the selected data. We can enclose The URL is
https://en.wikipedia.org/wiki/List_of_spreadsheet_software
And the XPath is
//div[contains(@class,'mw-content-ltr mw-parser-output')]/ul/li
The formula is
=IMPORTXML("https://en.wikipedia.org/wiki/List_of_spreadsheet_software","//div[contains(@class,'mw-content-ltr mw-parser-output')]/ul/li/a")
The result is:
Here are two common errors:
Import Xml content cannot be parsed. It means that the XPath input does not follow the correct syntax; thus Google Sheets cannot import any data from the webpage. Check the formula to make sure there is no typo and that the proper syntax is followed.
Imported content is empty. It can mean two things:
You may need to check the webpage itself to identify the proper XPath.
No. There are websites that actively block codes that import their content. This will cause Google Sheets to output errors in the IMPORTXML function when otherwise the XPath should work.
Additionally, there is data that you can only access by logging in to your account. These include important data about your customers, inventory, and sales stored in the online eCommerce and marketing platforms you use.
These can be accessed by using a built-in API that the platform accesses. It is tricky to set up the access via API in Google Sheets! There is a workaround, however. Try Lido. Unlike Google Sheets, importing data is a built-in feature, and Lido is optimized to import data from different eCommerce and marketing platforms you use such as Hubspot, Salesforce, MySQL databases, etc.