Sitemap parser

Status: Alpha

Brought to you by: esmondpitt, vanrogu

#17 Sitemap parser

Milestone: Next Release (example)

Status: accepted

Owner: EJP

Labels: None

Priority: 5

Updated: 2022-02-21

Created: 2021-10-06

Creator: EJP

Private: No

Add a Sitemap and Sitemap-index parser. This parser must be first in the list, as it has to match not on mime-type but on filename. It needs to handle both the XML format and the plaintext format for sitemap files, and it needs to generate URLFound events for every URL found within.
If robots.txt contained a Sitemap: entry, parse that sitemap URL, whether it is a sitemap or sitemap-index.
Otherwise, or if robots.txt could not be fetched, try to fetch both sitemap.xml and sitemap_index.xml. There don't appear to be any specified names for these files, especially the index, but these are the names that appear to be in common use.

EJP - 2022-02-21

status: open --> accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link: