123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109 |
- <html>
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
- <title>jTessBoxEditor - Box Editor & Trainer for Tesseract OCR Data</title>
- <style type="text/css">
- .auto-style1
- {
- text-decoration: underline;
- }
- </style>
- </head>
- <body lang="EN-US">
- <div>
- <h2 style="text-align: center;">
- jTessBoxEditor</h2>
- <h3>
- DESCRIPTION</h3>
- <p>
- <a href="http://vietocr.sourceforge.net/training.html">jTessBoxEditor</a> is a box
- editor and trainer for <a href="https://github.com/tesseract-ocr/">Tesseract OCR</a>,
- providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation
- of Tesseract training. It can read common image formats, including multi-page TIFF.
- </p>
- <p>
- jTessBoxEditor is released and distributed under the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License, v2.0</a>.
- </p>
- <h3>
- SYSTEM REQUIREMENTS</h3>
- <p>
- <a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html">Java Runtime
- Environment 7.0</a> or later.
- </p>
- <h3>
- INSTRUCTIONS</h3>
- <p>
- Double-click on the JAR file to launch the program, or execute the following command:
- </p>
- <blockquote>
- <p>
- <code>java -Xms128m -Xmx1024m -jar jTessBoxEditor.jar</code>
- </p>
- </blockquote>
- <p>
- You will need to provide the TIFF/Box files as input to the editor. Images to be
- used in training should be of 300 DPI and 1 bpp (bit per pixel) black&white
- or 8 bpp grayscale, uncompressed TIFF format; box files, encoded in UTF-8 format,
- are generated by Tesseract executables with appropriate command-line options (see
- <a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">Tesseract Training Wiki</a>). Or
- they both can be created using the built-in <em>TIFF/Box Generator</em>.</p>
- <p>
- The following hotkeys are available in Box View for ease of editing:</p>
- <ul>
- <li><strong>W/S</strong> - move box up/down;<strong> A/D</strong> - move box left/right</li>
- <li><strong>Q/E</strong> - decrease/increase box width;<strong> R/F</strong> - decrease/increase box height</li>
- <li><strong></></strong> - previous/next box</li>
- <li><strong>X</strong> - edit character in box</li>
- </ul>
- <p>
- Holding Shift when using hotkeys multiplies movement speed by 10.
- Pressing Enter or ESC when editing character focuses the box editor.</p>
- <p>
- Note that the coordinate system used in the box file has (0,0) at the bottom-left;
- on computer graphics devices, however, (0,0) is defined as top-left. jTessBoxEditor
- uses and displays in the graphics device coordinates. The edited box files are still
- read and written in proper format.
- </p>
- <p>
- The generator produces, for a given input UTF-8 text file, a TIFF/Box pair of files
- suitable for training with Tesseract. The generated image is, depending on anti-aliasing
- mode enabled, a binary or grayscale, uncompressed multi-page TIFF with 300 DPI resolution.
- Letter tracking, or spacing between characters, can be adjusted to eliminate bounding
- box overlapping issues. Note that the coordinates of some boxes could be slightly
- different (by 1 or 2 pixels) from the ones that would have been generated by Tesseract
- itself; nevertheless, the generated box file can be used to validate the one created
- by Tesseract with the use of a Unicode-compatible file compare tool, such as <a href="http://sourceforge.net/projects/winmerge/">
- WinMerge</a>.
- </p>
- <p>
- <span class="auto-style1">Tips</span>: Experiments indicate that the quality of
- training with images created by <em>TIFF/Box Generator</em> is higher with font
- sizes 24pt or greater and with some noise added.
- </p>
- <p>
- Combining symbols or diacritics, like those found in Devanagari or Indic scripts,
- that need to be combined with the main, base character can be specified by the user
- in a UTF-8 text file, specifically <code>data/combiningsymbols.txt</code>, which is
- read by the generator. This setup gives the users the flexibility in
- defining combining symbols/diacritics for their language scripts.</p>
- <p>
- Automated training is provided in latest version. Tesseract Windows training executables
- are bundled with the program; for other platforms, you will need to <a href="https://github.com/tesseract-ocr/tesseract/wiki/Compiling">
- build</a> them. Place all required source training data files, prefixed with
- an appropriate language code, in a specified directory (check <code>samples</code>
- folder for examples). The training operation can also be automated using the enclosed
- <code>train.ps1</code> Windows PowerShell script.
- </p>
- <p>
- The <em>Merge TIFF</em> function can save multiple images containing text of the
- same font into a single multi-page TIFF file to be used for training.
- A conversion function is included to convert numeric character reference (NCR) and
- escape sequence in the <em>Character</em> text field to Unicode characters.</p>
- <p>
- If there is any question, please post in <a href="http://sourceforge.net/projects/vietocr/forums">
- VietOCR Forums</a>.
- </p>
- <hr />
- </div>
- </body>
- </html>
|