When you build websites with MyST, there are two special kinds of metadata that get bundled with your MyST site. Each is explained below.
Page document metadata as .json data URLs¶
All webpages built with MyST come bundled with a JSON representation of their content. This is a machine-readable version of the page that contains all of the metadata and page structure defined by the MyST specification. Access the MyST JSON representation of a page by looking up the page’s data URL’.
When using folder URLs: For sites with the folders option enabled, this URL can be found by:
Removing any trailing
/Replacing
/with.in the pathname of the pageAdding a
.jsonextension
For example:
https://foo.com/folder/subfolder/page/
https://foo.com/folder.subfolder.page.jsonFor websites without folder structure: simply add .json to the end of the URL.
For example:
https://foo.com/long-page-name
https://foo.com/long-page-name.jsonMyST cross-reference data with myst.xref.json¶
When you create a MyST project on the web, all references in your MyST site are listed in a file that can be referenced by other projects using External References. This allows for programmatic reading of all MyST identifiers in a project (e.g. unique labels and the URL to which each resolves).
This is served in a file called myst.xref.json at the website root, and provides a list of reference links in JSON.
For example, the cross-references file for the MyST Guide is at this location:
https://mystmd.org/guide/myst.xref.jsonBelow is an example structure of this file:
{
"version": "1",
"myst": "1.2.0",
"references": [
{
"kind": "page",
"data": "/index.json",
"url": "/"
},
{
"identifier": "xref-features",
"kind": "heading",
"data": "/index.json",
"url": "/",
"implicit": true
}
]
}The myst.xref.json data structure has three entries:
- version
The version of the
myst.xref.jsonschema- myst
The version of
mystmdCLI that created themyst.xref.jsondata- references
A list of references that are exposed by the project, each object includes:
- identifier
- The identifier in the project for this reference, this will be unique in the project unless there is an
implicitflag. - This is only optional for pages, which may not have identifiers. All other content must have an identifier.
- html_id
- The identifier used on the HTML page, which is stricter than the
identifier. - This is only included if it differs from the
identifier.
- kind
- The kind of the reference, for example,
page,heading,figure,table.
- data
- The location of where to find the content as data. Use this link to find information like the reference’s enumerator, title or children.
- The URL is relative from the location of where the
myst.xref.jsonis served from.
- url
- The location of the HTML page; the URL is relative from the location of where the
myst.xref.jsonis served from. - For constructing specific links to HTML pages, use
<url>#<html_id || identifier>.
- implicit
- A boolean indicating that the reference is implicit to a page. This is common for headings, where the page information must be included.
How to navigate and scrape MyST sites¶
The myst.xref.json file enables programmatic access to all content in a MyST site. Here are some example workflows with Python.
Get all pages and their URLs¶
import requests
from IPython.display import JSON
site_url = "https://mystmd.org/guide"
xref_url = f"{site_url}/myst.xref.json"
print(f"Retrieving URL: {xref_url}")
xref_data = requests.get(xref_url).json()
# Filter for pages only
pages = [ref for ref in xref_data["references"] if ref["kind"] == "page"]
# Display the first 10 pages as JSON (de-indexed)
JSON({f"page_{i+1}": page for i, page in enumerate(pages[:10])})Get all instances of a specific content type¶
You can filter references by their kind to find all figures, tables, citations, or other content types:
from collections import Counter
# Count all content types
all_types = Counter(ref["kind"] for ref in xref_data["references"])
# Show examples of each type
figures = [ref for ref in xref_data["references"] if ref["kind"] == "figure"][:3]
tables = [ref for ref in xref_data["references"] if ref["kind"] == "table"][:3]
internal_refs = [ref for ref in xref_data["references"] if ref.get("implicit")][:3]
external_refs = [ref for ref in xref_data["references"] if ref["kind"] == "heading" and not ref.get("implicit")][:3]
JSON({
"all_content_types": dict(all_types),
**{f"figure_{i+1}": fig for i, fig in enumerate(figures)},
**{f"table_{i+1}": tbl for i, tbl in enumerate(tables)},
**{f"internal_ref_{i+1}": ref for i, ref in enumerate(internal_refs)},
**{f"external_ref_{i+1}": ref for i, ref in enumerate(external_refs)}
})Access page metadata and source information¶
Each reference includes a data field pointing to a JSON file with complete metadata:
# Get first page and fetch its metadata
page = next(ref for ref in xref_data["references"] if ref["kind"] == "page")
data_url = site_url + page["data"]
print(f"Retrieving URL: {data_url}")
page_data = requests.get(data_url).json()
JSON(page_data["frontmatter"])Download the MyST AST of a page¶
The JSON file at each page’s data URL contains the complete MyST Abstract Syntax Tree (AST), as defined in the MyST specification:
# Get the index page and fetch its AST
index_page = next(ref for ref in xref_data["references"] if ref["url"] == "/")
data_url = site_url + index_page["data"]
print(f"Retrieving URL: {data_url}")
myst_ast = requests.get(data_url).json()
JSON({
"kind": myst_ast["kind"],
"slug": myst_ast["slug"],
"mdast_children_count": len(myst_ast["mdast"]["children"]),
"mdast_first_child": myst_ast["mdast"]["children"][0]
})Find and download the exports and source file of a page¶
You can locate the original source file and available exports (e.g., PDF, JATS, Microsoft Word) for each page using the page’s JSON data:
# Get a page with "quickstart" in the URL
example_page = next(ref for ref in xref_data["references"] if "quickstart" in ref["url"])
data_url = site_url + example_page["data"]
print(f"Retrieving URL: {data_url}")
page_data = requests.get(data_url).json()
JSON({
"page_url": example_page["url"],
"page_title": page_data["frontmatter"]["title"],
"source_file": page_data["location"],
"exports": page_data["frontmatter"]["exports"]
})Download and display the source file content. Note that export URLs typically point to a CDN where the files are hosted (e.g., GitHub Pages):
from IPython.display import Markdown
# The export URL points to where the exported file is hosted
cdn_url = page_data["frontmatter"]["exports"][0]["url"]
print(f"Retrieving URL: {cdn_url}")
source_content = requests.get(cdn_url).text
preview = '\n\n'.join(source_content.split('\n\n')[:5])
Markdown(f"**Preview of {page_data['location']}:**\n\n{preview}\n\n---\n*... (content continues)*")Ensure your document is referenceable with CORS¶
To allow other MyST sites to reference your document, allow Cross-Origin Resource Sharing (CORS) from all origins (by setting Access-Control-Allow-Origin: *). This is on by default with GitHub Pages, but may not be enabled if you use a different provider (like Netlify).
Enabling all origins in CORS enables your site to:
Support MyST cross-references (e.g., xref links, see docs) from other sites that use your website as a citation source.
Enable content embedding, linking, or automated metadata access (e.g., for previews or API consumers) without authentication or same-origin constraints.
For an example of what it looks like to update CORS settings, see this GitHub PR updating CORS settings for Netlify.
What is CORS and why is it needed?¶
CORS is a browser security feature that restricts how resources on a web page can be requested from another domain outside the one that served the web page. For example, JavaScript running on external-site.org cannot fetch metadata or assets from https://
By setting CORS headers to allow all origins (*), you make it possible for external tools and sites to:
Reference your content directly via structured and stable links.
Preview or embed sections of your site from another page (with attribution).
Use it in federated, cross-site knowledge systems (like MyST Markdown references in external books or educational hubs).
If your MyST site is public and does not require authentication, allowing all origins does not pose a security risk.