WebDjVuTextEd Wiki

Edit the OCR text layer of DjVu documents in a web browser

Brought to you by: kempelen

Home

Attachments

load_book.png (12857 bytes)

WebDjVuTextEd - JavaScript DjVu Text layer editor

WebDjVuTextEd allows to edit the positioned text layer of OCR'ed DjVu documents in a web browser. You can modify the paragraph, line, word structure, create, delete, edit text nodes, modify their container box by mouse, run spellchecker. The program does not directly read the DjVu files, it requires exported text data and images. The server side is a very simple file save routine most of the editor is implemented in
JavaScript.

THIS WORKS ON CHROME ONLY - (More browser support may come later.)

Please try the Online Demo.

Features

View text layer over the book page image, similar to DjVu viewers
View text data in tree structure besides the image
Edit text structure: modify, create new nodes, delete, cut&paste, merge..
Edit coordinate boxes of words: drag sides to resize, move
Spellchecker (requires PHP server side)
Loads data from DjVuLibre 'djvutoxml' export, loads images from
DjVuLibre ddjvu export. Saves data to XML that you can import with
DjVuLibre djvuxmlparser

[ReleaseNotes]

Installation

Webserver with PHP or ASP.NET required for server side support. Without webserver or your own installation, you can save the document by copy-paste the DjVu XML data from the browser or use "Save as..." to your computer.

Copy files to webserver
Configure the save password in save.php (PHP) or web.config (ASP.NET)
Grant write access to the webserver to "data" directory
Be cautious that this will save any uploaded files that end with ".xml" to it into this directory.
You can set HTTP login using webserver's features to "data" directory to enhance security.
To use SpellChecker you also need
- php-pspell + pspell + language packs (e.g. aspell-hu) OR
- php-enchant + enchant + language packs ([SpellChecker] setup)
Point your browser to the installation folder, the WebDjVuEd File Manager window should greet you.

Usage

Extract text XML and image files from DjVu

You need to extract the XML from your book, copy to "data" dir.

djvutoxml mybook.djvu mybook.xml

To view book pages as background, you need to extract page image files too, copy images to e.g. "data/mybook".

ddjvu -format=tif mybook.djvu mybook.tif

Then explode the multipage TIFF to individual PNG files. On Linux:

mkdir mybook
for i in {0..129}; do convert "mybook.tif[$i]" mybook/mybook-$i.png; done

On windows you can use XnView's Tools -> Multipage File -> Extract all into... In the "tools" directory you can find a Bash shell script that helps extracting data from DjVu.

Load book to the editor

Point your browser to the installed WebDjVuTextEd DjVu editor, e.g. http: //localhost/webdjvutexted/

In "Load book" form enter the location of the DjVu XML file and the relative (to the XML) path to images and press Load book.

Example: Assuming you use the default data directory for your XML and a subdirectory for images:

data/gozgep_demo.xml
data/gozgep/gozgep-0.png
data/gozgep/gozgep-1.png
...

The Image name is a pattern, but if you enter the first page's file name, the pattern should be automatically created. The (last) number in the file name must be a page index, unless one single file is used. (If images start from "0", the pattern will contain "%", if images start from "1", pattern will contain "#". If numbers are padded with zeroes, there will be multiple %%% or ###. More info see [FileOpenSave] )

You can also choose the XML and PNG files from your PC using the file browse buttons. For the images select multiple files, the same "file pattern" system will be used.

The recommended use of the editor is to install on your own webserver and let PHP or ASP.NET save pages in the "data" directory.

Understand DjVu text structure

Every DjVu the text layer consists of a structure like this (see DjVu spec)

PAGECOLUMN
|-REGION
  |-PARAGRAPH
    |-LINE
      |-WORD
        |-CHARACTER

Any of these may be the "last node" that contain text and coordinates of text in the underlaying image's coordinate system. A node that has child nodes cannot contain text with coordinates.

Content should be on this tree in reading order.

In most cases WORD contains a word in a box. But there are some documents that store every single character in a separeted box (not feasible to edit manually). Also there might be documents that contain LINEs only and text is written into LINE, but WORD is the most common level of separation.

For example, in Document Express, you can choose WORD and CHARACTER level separation.

Edit the DjVu structure, OCR text content and word boundary boxes

You can see and edit the mentioned "tree struture" on the left side of the screen. To modify the tree, use right click menu on the tree or on the word boxes. The following options are available.

Note: since most common separation is WORD, the below referred "last node" is usually the WORD node, while container nodes are LINE, PARAGRAPH etc.

Edit: edits the text, only on last nodes (also: double click or F2)
Delete: delete selected nodes and all child nodes (there is no confirmation box if there is no actual text in any nodes)
Create new before: inserts the same type of node directly before the clicked node. If that will be a last node, you will need to create a word box for it, if there is enough info, the program creates the word box at estimated location.
Create new after: same
Create new into: this adds a new child node to the end of child nodes. Also creates the text box if can estimate coordinates.
Add coords box: Manually creates the coordinate box, that you need to drag into the right location. Available only on last nodes and only if there is no coord box yet.
Merge selected: merges the selected nodes. Only same level nodes can be selected and they must form a continuous sequence. All must or must not have text-coords. If there are no text-coords, nodes are simply arranged into the same parent. If selected nodes are text, they will be adjoined into one text box with all selected text.
Merge (with spaces): Same as merge, but when merging text, this adds spaces between the merged text nodes. However usually each WORD is a separate box, so this is not needed.
Merge to parent: Merges the selected nodes into their parent node. All child nodes must be selected. Useful to merge for example a series of CHARACTER nodes into a WORD.
Merge to parent (spaces): Same as previous, but concatenates with spaces. May be useful to merge WORDS into a LINE, but that's unusual separation.
Merge two columns: This can create continuous lines from lines mistakenly split by the OCR into two page-columns. As an example of this function consider when page numbers in a Table of Contents ended up as another page COLUMN instead being part of the same lines. You will have to rearrange the nodes first to let this merge work. To use select lines in a paragraph, and same number of lines in another paragraph, then all words of the second paragraph will be added to the end of the lines of the first paragraph. See [MergeTwoColumns].
Cut: cuts the selected nodes, then 3 new menus will appear: Paste before, Paste after, Paste into - works similar to "create". Paste allows to move a node to an upper level than where it belongs, in this case all necessary parents will be created. Cut requires that all selected nodes are the same type.
Copy: Works different than Cut&Paste! While cut does not change the coordinate data and is mainly used for fixing structural errors, copy can be used to create new items, thus they need new coordinates. If you copy something to another line, the item will try to retain it's horizontal position, unless that overlaps with the text used for paste. Copy handles only one selected node and it must have text and coords, no child nodes.
Bring to view: scrolls the picture view so that the word becomes visible. (Or when used on the page-image, scrolls the tree pane.)
Split: there is no dedicated split menu for doing the opposite of a merge. But you can split a word to more words by writing a space characters to it, see [SplitToWords]

Note that some operations may leave useless empty tree nodes behind. (The editor does not know if you plan to use them.)

There are also several [HotKeys].

On the right side you'll see the page image and the overlapping text boxes with boundary border. When you select a box, 5 drag handles will appear. Please keep in mind that DjVu word boxes should not overlap and a "line" - holding maximum extent of all contained words - should not also overlap
with previous or next line - as DjVu standard mentions. This editor does not enforce this. According to my tests overlapping does not cause problems.

Spellchecker

When you press "Spellcheck" button, the actual page will be checked and errors will be marked red. Clicking the error will present you suggestions. To remove spellchecking data from the current
page, you can press the "X" (so you can continue editing without getting word suggestions.)

Note that you can change the spell engine and language any time without reloading the book.

More info about the [SpellChecker] setup.

Save backends

When you switch page, the program will send the whole document to the server to save, which should be able to write into the same XML file that you loaded. So in case you reload with F5, you will load your modified file.

Dump to textarea: There is no autosave in this mode. When you press "Save document", the XML code will be shown in a popup window that you can save by copy-paste. (This mode works when you are using the Online Demo or a local copy without webserver.)
Save as..: There is no page-turn autosave in this mode. When you press "Save document", a "Save as" window will appear allowing you to save the XML data file to your computer. (This mode also works in no-webserver modes.)
PHP: requires a webserver with PHP and WebDjVuTextEd installed there. In this case turning pages will automatically save the XML data, this is the recommended use.
ASP.NET: same as PHP but for Windows Servers.

Load finished XML text to the DjVu OCR layer

When you are ready with the updates, you can write back the XML data into the DjVu file:

djvuxmlparser -o mybook.djvu mybook.xml

Happy DjVu editing!
Hungarian users, please look at my http://www.djvu.hu website for more information.

Credits

WebDjVuEd
(c) 2014-2015 Ferenc Veres, GPL v3
https://sourceforge.net/projects/webdjvutexted/

jquery & jquery-ui
(c) 2014 The jQuery Foundation, MIT license
http://jquery.com/

jquery-spellchecker
(c) 2012 Richard Willis, MIT license
https://github.com/badsyntax/jquery-spellchecker

jstree.com
(c) 2014 Ivan Bozhanov, MIT license
http://jstree.com/

FileSaver.js
(c) 2013 Eli Grey, MIT license
https://github.com/eligrey/FileSaver.js

shortcut.js
(c) Binny V A, BSD license
http://www.openjs.com/scripts/events/keyboard_shortcuts/

jQuery.scrollTo
(c) 2007-2015 Ariel Flesler, MIT license
http://flesler.blogspot.com/2007/10/jqueryscrollto.html

Project Members:

Kempelen (admin)

Wiki: FileOpenSave
Wiki: HotKeys
Wiki: MergeTwoColumns
Wiki: ReleaseNotes
Wiki: SpellChecker
Wiki: SplitToWords