dpanalyzer Wiki

postprocessing tool for Project Gutenberg Distributed Proofreaders

Status: Alpha

Brought to you by: ensegre

Home

Authors:

Aims of this tool:

The terminology referred to here, is that in use at PGDP. dpanalyzer is written as a tool to:

Verify the consistency of the markup of a project as it exits the Formatting rounds, report problems and possible errors.
Do automatical reorganizations, like renumbering pages, moving around figures and footnotes out of paragraphs, rejoin paragraphs split across consecutive pages, rejoin footnotes continuing on consecutive pages.
Convert the DP-formatted text in either DP-normalized text, or basic LaTeX, or basic HTML, to serve as starting point for PostProcessing the project.

The scope of this tool is not universal. It may not be good for any PGDP project, but it is hopefully complete and consistent for projects using the standard DP formatting.
Projects for which special formatting has been requested may not be treated nicely by dpanalyzer, but special instructions can't be predicted in general.

Markup which is handled by this tool and which is not

In a way, this tool reverse-engineers the Formatting Guidelines. Given the markup found in the DP file, dpanalyzer attempts to infer the semantical or presentational element demanding it. The tags treated are illustrated in the wiki page [DP formatting grammar parsing].
This is not always possible without ambiguity, if at all. In many cases the Guidelines impose a markup and a formatting of some element of text which cannot be distinguished from that of another, or can only by resorting to fragile heuristics; in some other the Guidelines leave too much freedom of rendering to be useful. Notably:

Frontespieces, Titlepages and the like have no defined structural markup. The Guidelines only ask to wrap the whole page in /* */, "Give it to PP, it's his/her job".
Indices and Tables of contents: while the Guidelines ask for certain fixed rules (removal of the trailing line markers, indentations of index subentries), there is no characteristic identifying either for sure, and distinguishing them from, say, poems with isolated verses ending perhaps with numbers.
Tables. The Guidelines just encourage the formatters in the direction of ASCII art, but nothing in the markup identifies with certainity cells, headers and groupings. The potential variety of tables and textual diagrams also prevents ruling.
References in text to Figures and Tables: none is enforced, while there is a standardization for Footnotes. Hence, some automatism for collecting and moving around footnotes can be implemented, and none for Figures and Tables.
Continued paragraphs in continuing footnotes: can be only be determined in presence of split words, marked by -* and *, but not in general. Hence Continuing Footnotes will be treated as if a new Footnote Paragraph starts with the continuation.

Additionally, Titles, Subtitles and Section titles, as well as Blockquotes and Line-preserved blocks, are assumed to contain no element but text inside them; i.e. (presently) they are supposed to be elements and not containers. There are a number of markup situations where this could not be true:

decorated title headings, using [Illustration] separated by a number of blank lines from the bylines;
illustrated frontispieces, where everything including [Illustration] is wrapped up in /* */;
nested blockquotes within poetry, or viceversa;
Thought Break within a Blockquote;
tables, marked as /* */, within which almost anything is possible.

In first instance I've preferred not to aim for universal coverage, and the parser will report such cases as errors, or produce inconsistent results.

Use of the program

See the page [User Guide].