提交 cd651045 编写于 作者: T tmbdev

top-skimming import from sf.net

git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@2 d0cd1f9f-072b-0410-8dd7-cf729c803f20
上级 6e077331
BUILD
OWNERS
Makefile
README.google
runautoconf
config_auto.h
Ray Smith (lead developer) <theraysmith@users.sourceforge.net>
Phil Cheatle
Simon Crouch
Dan Johnson
Mark Seaman
Sheelagh Huddleston
Chris Newton
... and several others.
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Other Dependencies and Licenses:
================================
The Aspirin/MIGRAINES system is no longer used.
Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.
June 2006 - V1.0 of open source Tesseract checked-in.
Sep 7 2006 - V1.01.
Added mfcpch.cpp and getopt.cpp for VC++.
Fixed problem with greyscale images and no libtiff.
Stopped debug window from being used for the usage output.
Fixed load of inttemp for big-endian architectures.
Fixed some Mac compilation issues.
Oct 4 2006 - V1.02
Removed dependency on Aspirin.
Fixed a few missing Apache license headers.
Removed $log.
Feb 2 2007 - V1.03
Added mftraining and cntraining.
Added baseapi with adaptive thresholding for grey and color.
Fixed many memory leaks.
Fixed several bugs including lack of use of adaptive classifier.
Added ifdefs to eliminate graphics code and add embedded platform support.
Incorporated several patches, including 64-bit builds, Mac builds.
Minor accuracy improvements.
Copyright 1994, 1995, 1996, 1999, 2000, 2001, 2002 Free Software
Foundation, Inc.
This file is free documentation; the Free Software Foundation gives
unlimited permission to copy, distribute and modify it.
Basic Installation
==================
These are generic installation instructions.
The `configure' shell script attempts to guess correct values for
various system-dependent variables used during compilation. It uses
those values to create a `Makefile' in each directory of the package.
It may also create one or more `.h' files containing system-dependent
definitions. Finally, it creates a shell script `config.status' that
you can run in the future to recreate the current configuration, and a
file `config.log' containing compiler output (useful mainly for
debugging `configure').
It can also use an optional file (typically called `config.cache'
and enabled with `--cache-file=config.cache' or simply `-C') that saves
the results of its tests to speed up reconfiguring. (Caching is
disabled by default to prevent problems with accidental use of stale
cache files.)
If you need to do unusual things to compile the package, please try
to figure out how `configure' could check whether to do them, and mail
diffs or instructions to the address given in the `README' so they can
be considered for the next release. If you are using the cache, and at
some point `config.cache' contains results you don't want to keep, you
may remove or edit it.
The file `configure.ac' (or `configure.in') is used to create
`configure' by a program called `autoconf'. You only need
`configure.ac' if you want to change it or regenerate `configure' using
a newer version of `autoconf'.
The simplest way to compile this package is:
1. `cd' to the directory containing the package's source code and type
`./configure' to configure the package for your system. If you're
using `csh' on an old version of System V, you might need to type
`sh ./configure' instead to prevent `csh' from trying to execute
`configure' itself.
Running `configure' takes awhile. While running, it prints some
messages telling which features it is checking for.
2. Type `make' to compile the package.
3. Optionally, type `make check' to run any self-tests that come with
the package.
4. Type `make install' to install the programs and any data files and
documentation.
5. You can remove the program binaries and object files from the
source code directory by typing `make clean'. To also remove the
files that `configure' created (so you can compile the package for
a different kind of computer), type `make distclean'. There is
also a `make maintainer-clean' target, but that is intended mainly
for the package's developers. If you use it, you may have to get
all sorts of other programs in order to regenerate files that came
with the distribution.
Compilers and Options
=====================
Some systems require unusual options for compilation or linking that
the `configure' script does not know about. Run `./configure --help'
for details on some of the pertinent environment variables.
You can give `configure' initial values for configuration parameters
by setting variables in the command line or in the environment. Here
is an example:
./configure CC=c89 CFLAGS=-O2 LIBS=-lposix
*Note Defining Variables::, for more details.
Compiling For Multiple Architectures
====================================
You can compile the package for more than one kind of computer at the
same time, by placing the object files for each architecture in their
own directory. To do this, you must use a version of `make' that
supports the `VPATH' variable, such as GNU `make'. `cd' to the
directory where you want the object files and executables to go and run
the `configure' script. `configure' automatically checks for the
source code in the directory that `configure' is in and in `..'.
If you have to use a `make' that does not support the `VPATH'
variable, you have to compile the package for one architecture at a
time in the source code directory. After you have installed the
package for one architecture, use `make distclean' before reconfiguring
for another architecture.
Installation Names
==================
By default, `make install' will install the package's files in
`/usr/local/bin', `/usr/local/man', etc. You can specify an
installation prefix other than `/usr/local' by giving `configure' the
option `--prefix=PATH'.
You can specify separate installation prefixes for
architecture-specific files and architecture-independent files. If you
give `configure' the option `--exec-prefix=PATH', the package will use
PATH as the prefix for installing programs and libraries.
Documentation and other data files will still use the regular prefix.
In addition, if you use an unusual directory layout you can give
options like `--bindir=PATH' to specify different values for particular
kinds of files. Run `configure --help' for a list of the directories
you can set and what kinds of files go in them.
If the package supports it, you can cause programs to be installed
with an extra prefix or suffix on their names by giving `configure' the
option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'.
Optional Features
=================
Some packages pay attention to `--enable-FEATURE' options to
`configure', where FEATURE indicates an optional part of the package.
They may also pay attention to `--with-PACKAGE' options, where PACKAGE
is something like `gnu-as' or `x' (for the X Window System). The
`README' should mention any `--enable-' and `--with-' options that the
package recognizes.
For packages that use the X Window System, `configure' can usually
find the X include and library files automatically, but if it doesn't,
you can use the `configure' options `--x-includes=DIR' and
`--x-libraries=DIR' to specify their locations.
Specifying the System Type
==========================
There may be some features `configure' cannot figure out
automatically, but needs to determine by the type of machine the package
will run on. Usually, assuming the package is built to be run on the
_same_ architectures, `configure' can figure that out, but if it prints
a message saying it cannot guess the machine type, give it the
`--build=TYPE' option. TYPE can either be a short name for the system
type, such as `sun4', or a canonical name which has the form:
CPU-COMPANY-SYSTEM
where SYSTEM can have one of these forms:
OS KERNEL-OS
See the file `config.sub' for the possible values of each field. If
`config.sub' isn't included in this package, then this package doesn't
need to know the machine type.
If you are _building_ compiler tools for cross-compiling, you should
use the `--target=TYPE' option to select the type of system they will
produce code for.
If you want to _use_ a cross compiler, that generates code for a
platform different from the build platform, you should specify the
"host" platform (i.e., that on which the generated programs will
eventually be run) with `--host=TYPE'.
Sharing Defaults
================
If you want to set default values for `configure' scripts to share,
you can create a site shell script called `config.site' that gives
default values for variables like `CC', `cache_file', and `prefix'.
`configure' looks for `PREFIX/share/config.site' if it exists, then
`PREFIX/etc/config.site' if it exists. Or, you can set the
`CONFIG_SITE' environment variable to the location of the site script.
A warning: not all `configure' scripts look for a site script.
Defining Variables
==================
Variables not defined in a site shell script can be set in the
environment passed to `configure'. However, some packages may run
configure again during the build, and the customized values of these
variables may be lost. In order to avoid this problem, you should set
them in the `configure' command line, using `VAR=value'. For example:
./configure CC=/usr/local2/bin/gcc
will cause the specified gcc to be used as the C compiler (unless it is
overridden in the site shell script).
`configure' Invocation
======================
`configure' recognizes the following options to control how it
operates.
`--help'
`-h'
Print a summary of the options to `configure', and exit.
`--version'
`-V'
Print the version of Autoconf used to generate the `configure'
script, and exit.
`--cache-file=FILE'
Enable the cache: use and save the results of the tests in FILE,
traditionally `config.cache'. FILE defaults to `/dev/null' to
disable caching.
`--config-cache'
`-C'
Alias for `--cache-file=config.cache'.
`--quiet'
`--silent'
`-q'
Do not print messages saying which checks are being made. To
suppress all normal output, redirect it to `/dev/null' (any error
messages will still be shown).
`--srcdir=DIR'
Look for the package's source code in directory DIR. Usually
`configure' can determine that directory automatically.
`configure' also accepts some other, not widely useful, options. Run
`configure --help' for more details.
# TODO(luc) Add 'doc' to this list when ready
SUBDIRS = ccstruct ccutil classify cutil dict display image textord viewer wordrec ccmain training
EXTRA_DIST = tessdata phototest.tif tesseract.dsp tesseract.dsw
#EXTRA_DIST = doc/html doc/@PACKAGE_NAME@_@PACKAGE_VERSION@.pdf doc/@PACKAGE_NAME@_@PACKAGE_VERSION@.ps.gz
dist-hook:
# Need to remove CVS directories from directories
# added using EXTRA_DIST. $(distdir)/tessdata would in
# theory suffice.
rm -rf `find $(distdir) -name CVS`
# Also remove extra files not needed in a distribution
rm -rf `find $(distdir) -name configure.ac`
rm -rf `find $(distdir) -name acinclude.m4`
rm -rf `find $(distdir) -name aclocal.m4`
此差异已折叠。
Stub file. To be populated at a later stage.
Introduction
============
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Other Dependencies and Licenses:
================================
The Aspirin/MIGRAINES system is no longer required.
Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.
History:
========
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc2.95 and under Windows
with VC++6. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficent than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug. Another "feature" of the C/C++ split is that the C++
data structures get converted to C data structures to call the low-level C
code. This is ugly, and the C++izing of the C code is a step towards
eliminating the conversion, but it has not happened yet.
Directory Structure (ordered by dependency):
============================================
ccmain Top-level code. The main program resides in tesseractmain.cpp.
display An "editor" to view and operate on the internal structures.
(Requires a working viewer - batteries not included.)
wordrec The word-level recognizer.
textord The module that organizes(orders) text into lines and words.
classify The low-level character classifiers.
ccstruct Classes to hold information about a page as it is being processed.
viewer The client side of a client server viewing system.
Unfortunately, at this time, the server side is not available.
image Image class and processing functions.
dict Language model code.
cutil Code for file I/O, lists, heaps etc, from the old C code.
ccutil Somewhat newer code for lists, memory allocation etc from the
old C++ code.
About the Engine
================
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
FORMATTING, and NO UI. It can only process an image of a single column
and create text from it. It can detect fixed pitch vs proportional text.
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows. Another current
limitation is that it only recognizes English and its character set is only
US-ASCII. Training code IS included in the open source release however, and
will be included in a future release.
Using the Engine
================
The usage of both Windows and Linux versions is the same.
The executable must reside in the same directory as the tessdata directory
The command line is:
tesseract <image.tif> <output> batch
The image file requires an .tif extension for its type to be recognized
correctly. If a file exists with the .tif extension replaced by .uzn, then it
will be interpreted as a UNLV-style zone file. (See www.isri.unlv.edu for
details of the zone files.)
Tesseract release notes Feb 2, 2007 - V1.03.
Added mftraining and cntraining. Using an image with a box file, tesseract
generates .tr output files. cntraining runs on the .tr files to make
normproto that lives in tessdata. mftraining runs on the .tr files to
make inttemp and pffmtable in tessdata. These are the main data files
that tesseract uses to recognize characters. At present, the code to make
dictionary files is not yet available, nor are any sample box files or
rebuilt inttemp or documentation to create any of these. Recognition is
still limited to the ASCII set, but when this problem is fixed, documentation
will follow.
Added a new API with adaptive thresholding for grey and color images.
See ccmain/baseapi.h/cpp for details. The main program has been converted
to use the API as an example. See main() in ccmain/tesseractmain.cpp for
details. The API is designed to make it easy to add subclasses with ability
to output the bounding boxes etc from the internal structures. The adaptive
thresholding improves accuracy (most of the time) on non-binary images.
Many memory leaks have been fixed. There are no known leaks left from using
the API correctly.
The adaptive classifier was not operating correctly. This bug, and several
others have been fixed, including poor chopping, an indefinite (if not quite
infinite) loop in the number parser, and a couple of crash bugs. Thanks to
all that have contributed bugs and bug fixes.
It is now possible to build without any of the graphics support to save code
size using #define GRAPHICS_DISABLED. There is also a new EMBEDDED define
for use on operating systems with limited library support.
64-bit and Mac OSX buildability is now included in the mainline source tree.
Thanks to all that have contributed patches and comments to help with that.
1.03 is also endian-independent, apart from the tiff i/o, so if you use
libtiff, the code should run on all platforms, even if you get/create new
data files of a different endinanness.
Some of the bug fixes improve accuracy, and so do some of the changes to
DangAmbigs and user-words.
Tesseract release notes, Oct 4 2006 - V1.02.
Removed dependency on aspirin. *All* code is now licensed under Apache2.0.
Tesseract release notes, Sep 7 2006 - V1.01.
Fixes for this release:
Added mfcpch.cpp and getopt.cpp for VC++.
Fixed problem with greyscale images and no libtiff.
Stopped debug window from being used for the usage output.
Fixed load of inttemp for big-endian architectures.
Fixed some Mac compilation issues.
This version should read uncompressed 8 bit grey and 24 bit color tiffs
without having to have libtiff. It does a dumb threshold though, so don't
expect good results from poor contrast or images of natural scenes etc.
If you just run tesseract with no command line args you should now get a
sensible usage message on linux, with or without X-windows.
If you can get it to compile on a PPC Mac, it may now run correctly,
although not all the build issues are fixed yet.
Building Tesseract:
Windows:
Unpack the tar.gz archive
Open tesseract.dsw in DevStudio (preferably version 6, higher versions will be more difficult)
Set Win32 - Release as the active configuration.
Build.
Copy tesseract.exe from bin.rel up one directory level.
Run tesseract phototest.tif phototest
This will create phototest.txt.
Linux:
Unpack the tar.gz archive
./configure
make
Copy tesseract from ccmain up one directory level (or create a symbolic link)
Run tesseract phototest.tif phototest
This will create phototest.txt.
# Master include for AC macros. This directory structure allows
# for more flexibility with respect to CVS modules.
#
# Author: Luc Vincent
### m4_include(config/ac_compile_check_sizeof.m4)dnl
#m4_include(config/ac_create_stdint_h.m4)dnl
#m4_include(config/ax_create_stdint_h.m4)dnl
m4_include(config/ac_define_versionlevel.m4)dnl
m4_include(config/acinclude_custom.m4)dnl
此差异已折叠。
SUBDIRS =
AM_CPPFLAGS = \
-I$(top_srcdir)/ccutil -I$(top_srcdir)/ccstruct \
-I$(top_srcdir)/image -I$(top_srcdir)/viewer \
-I$(top_srcdir)/ccops -I$(top_srcdir)/dict \
-I$(top_srcdir)/classify -I$(top_srcdir)/display \
-I$(top_srcdir)/wordrec -I$(top_srcdir)/cutil \
-I$(top_srcdir)/textord
EXTRA_DIST = \
adaptions.h applybox.h baseapi.h blobcmp.h \
callnet.h charcut.h \
control.h docqual.h expandblob.h fixspace.h fixxht.h \
imgscale.h matmatch.h output.h paircmp.h reject.h scaleimg.h \
tessbox.h tessedit.h tesseractmain.h tessvars.h tfacep.h \
tessembedded.h tfacepp.h tstruct.h werdit.h
noinst_LIBRARIES = libtesseract_main.a
libtesseract_main_a_SOURCES = \
tessedit.cpp adaptions.cpp applybox.cpp \
baseapi.cpp blobcmp.cpp \
callnet.cpp charcut.cpp charsample.cpp control.cpp \
docqual.cpp expandblob.cpp fixspace.cpp fixxht.cpp \
imgscale.cpp matmatch.cpp output.cpp paircmp.cpp \
reject.cpp scaleimg.cpp tessbox.cpp tessvars.cpp \
tfacepp.cpp tstruct.cpp werdit.cpp
bin_PROGRAMS = tesseract
tesseract_SOURCES = tesseractmain.cpp
tesseract_LDADD = \
libtesseract_main.a \
../display/libtesseract_display.a \
../textord/libtesseract_textord.a \
../wordrec/libtesseract_wordrec.a \
../classify/libtesseract_classify.a \
../dict/libtesseract_dict.a \
../viewer/libtesseract_viewer.a \
../image/libtesseract_image.a \
../cutil/libtesseract_cutil.a \
../ccstruct/libtesseract_ccstruct.a \
../ccutil/libtesseract_ccutil.a
此差异已折叠。
此差异已折叠。
/**********************************************************************
* File: adaptions.h (Formerly adaptions.h)
* Description: Functions used to adapt to blobs already confidently
* identified
* Author: Chris Newton
* Created: Thu Oct 7 10:17:28 BST 1993
*
* (C) Copyright 1992, Hewlett-Packard Ltd.
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
*
**********************************************************************/
#ifndef ADAPTIONS_H
#define ADAPTIONS_H
#include "charsample.h"
#include "charcut.h"
#include "notdll.h"
extern BOOL_VAR_H (tessedit_reject_ems, FALSE, "Reject all m's");
extern BOOL_VAR_H (tessedit_reject_suspect_ems, FALSE, "Reject suspect m's");
extern double_VAR_H (tessedit_cluster_t1, 0.20,
"t1 threshold for clustering samples");
extern double_VAR_H (tessedit_cluster_t2, 0.40,
"t2 threshold for clustering samples");
extern double_VAR_H (tessedit_cluster_t3, 0.12,
"Extra threshold for clustering samples, only keep a new sample if best score greater than this value");
extern double_VAR_H (tessedit_cluster_accept_fraction, 0.80,
"Largest fraction of characters in cluster for it to be used for adaption");
extern INT_VAR_H (tessedit_cluster_min_size, 3,
"Smallest number of samples in a cluster for it to be used for adaption");
extern BOOL_VAR_H (tessedit_cluster_debug, FALSE,
"Generate and print debug information for adaption by clustering");
extern BOOL_VAR_H (tessedit_use_best_sample, FALSE,
"Use best sample from cluster when adapting");
extern BOOL_VAR_H (tessedit_test_cluster_input, FALSE,
"Set reject map to enable cluster input to be measured");
extern BOOL_VAR_H (tessedit_matrix_match, TRUE, "Use matrix matcher");
extern BOOL_VAR_H (tessedit_old_matrix_match, FALSE, "Use matrix matcher");
extern BOOL_VAR_H (tessedit_mm_use_non_adaption_set, FALSE,
"Don't try to adapt to characters on this list");
extern STRING_VAR_H (tessedit_non_adaption_set, ",.;:'~@*",
"Characters to be avoided when adapting");
extern BOOL_VAR_H (tessedit_mm_adapt_using_prototypes, TRUE,
"Use prototypes when adapting");
extern BOOL_VAR_H (tessedit_mm_use_prototypes, TRUE,
"Use prototypes as clusters are built");
extern BOOL_VAR_H (tessedit_mm_use_rejmap, FALSE,
"Adapt to characters using reject map");
extern BOOL_VAR_H (tessedit_mm_all_rejects, FALSE,
"Adapt to all characters using, matrix matcher");
extern BOOL_VAR_H (tessedit_mm_only_match_same_char, FALSE,
"Only match samples against clusters for the same character");
extern BOOL_VAR_H (tessedit_process_rns, FALSE, "Handle m - rn ambigs");
extern BOOL_VAR_H (tessedit_demo_adaption, FALSE,
"Display cut images and matrix match for demo purposes");
extern INT_VAR_H (tessedit_demo_word1, 62,
"Word number of first word to display");
extern INT_VAR_H (tessedit_demo_word2, 64,
"Word number of second word to display");
extern STRING_VAR_H (tessedit_demo_file, "academe",
"Name of document containing demo words");
BOOL8 word_adaptable( //should we adapt?
WERD_RES *word,
UINT16 mode);
void collect_ems_for_adaption(WERD_RES *word,
CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void collect_characters_for_adaption(WERD_RES *word,
CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void cluster_sample(CHAR_SAMPLE *sample,
CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void check_wait_list(CHAR_SAMPLE_LIST *chars_waiting,
CHAR_SAMPLE *sample,
CHAR_SAMPLES *best_cluster);
void complete_clustering(CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void adapt_to_good_ems(WERD_RES *word,
CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void adapt_to_good_samples(WERD_RES *word,
CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
void print_em_stats(CHAR_SAMPLES_LIST *char_clusters,
CHAR_SAMPLE_LIST *chars_waiting);
//lines of the image
CHAR_SAMPLE *clip_sample(PIXROW *pixrow,
IMAGELINE *imlines,
BOX pix_box, //box of imlines extent
BOOL8 white_on_black,
char c);
void display_cluster_prototypes(CHAR_SAMPLES_LIST *char_clusters);
void reject_all_ems(WERD_RES *word);
void reject_all_fullstops(WERD_RES *word);
void reject_suspect_ems(WERD_RES *word);
void reject_suspect_fullstops(WERD_RES *word);
BOOL8 suspect_em(WERD_RES *word, INT16 index);
BOOL8 suspect_fullstop(WERD_RES *word, INT16 i);
#endif
此差异已折叠。
/**********************************************************************
* File: applybox.h (Formerly applybox.h)
* Description: Re segment rows according to box file data
* Author: Phil Cheatle
* Created: Wed Nov 24 09:11:23 GMT 1993
*
* (C) Copyright 1993, Hewlett-Packard Ltd.
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
*
**********************************************************************/
#ifndef APPLYBOX_H
#define APPLYBOX_H
#include "varable.h"
#include "ocrblock.h"
#include "ocrrow.h"
#include "notdll.h"
extern BOOL_VAR_H (applybox_rebalance, TRUE, "Drop dead");
extern INT_VAR_H (applybox_debug, 0, "Debug level");
extern STRING_VAR_H (applybox_test_exclusions, "|",
"Chars ignored for testing");
extern double_VAR_H (applybox_error_band, 0.15, "Err band as fract of xht");
void apply_boxes(BLOCK_LIST *block_list //real blocks
);
void clear_any_old_text( //remove correct text
BLOCK_LIST *block_list //real blocks
);
BOOL8 read_next_box(FILE* box_file, //
BOX *box,
char *ch);
ROW *find_row_of_box( //
BLOCK_LIST *block_list, //real blocks
BOX box, //from boxfile
INT16 &block_id,
INT16 &row_id_to_process);
INT16 resegment_box( //
ROW *row,
BOX box,
char *ch,
INT16 block_id,
INT16 row_id,
INT16 boxfile_lineno,
INT16 boxfile_charno);
void tidy_up( //
BLOCK_LIST *block_list, //real blocks
INT16 &ok_char_count,
INT16 &ok_row_count,
INT16 &unlabelled_words,
INT16 *tgt_char_counts,
INT16 &rebalance_count,
char &min_char,
INT16 &min_samples,
INT16 &final_labelled_blob_count);
void report_failed_box(INT16 boxfile_lineno,
INT16 boxfile_charno,
BOX box,
char *box_ch,
const char *err_msg);
void apply_box_training(BLOCK_LIST *block_list);
void apply_box_testing(BLOCK_LIST *block_list);
#endif
此差异已折叠。
///////////////////////////////////////////////////////////////////////
// File: baseapi.h
// Description: Simple API for calling tesseract.
// Author: Ray Smith
// Created: Fri Oct 06 15:35:01 PDT 2006
//
// (C) Copyright 2006, Google Inc.
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//
///////////////////////////////////////////////////////////////////////
#ifndef THIRD_PARTY_TESSERACT_CCMAIN_BASEAPI_H__
#define THIRD_PARTY_TESSERACT_CCMAIN_BASEAPI_H__
#include <string>
#include "host.h"
#include "ocrclass.h"
class PAGE_RES;
class BLOCK_LIST;
// Base class for all tesseract APIs.
// Specific classes can add ability to work on different inputs or produce
// different outputs.
class TessBaseAPI {
public:
// Start tesseract.
// The datapath must be the name of the data directory or some other file
// in which the data directory resides (for instance argv[0].)
// The configfile is the name of a file in the tessconfigs directory
// (eg batch) or NULL to run on defaults.
// Outputbase may also be NULL, and is the basename of various output files.
// If the output of any of these files is enabled, then a name must be given.
// If numeric_mode is true, only possible digits and roman numbers are
// returned. Returns 0 if successful. Crashes if not.
// The argc and argv may be 0 and NULL respectively. They are used for
// providing config files for debug/display purposes.
// TODO(rays) get the facts straight. Is it OK to call
// it more than once? Make it properly check for errors and return them.
static int Init(const char* datapath, const char* outputbase,
const char* configfile, bool numeric_mode,
int argc, char* argv[]);
// Recognize a rectangle from an image and return the result as a string.
// May be called many times for a single Init.
// Currently has no error checking.
// Greyscale of 8 and color of 24 or 32 bits per pixel may be given.
// Palette color images will not work properly and must be converted to
// 24 bit.
// Binary images of 1 bit per pixel may also be given but they must be
// byte packed with the MSB of the first byte being the first pixel, and a
// 1 represents WHITE. For binary images set bytes_per_pixel=0.
// The recognized text is returned as a char* which (in future will be coded
// as UTF8 and) must be freed with the delete [] operator.
static char* TesseractRect(const UINT8* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top, int width, int height);
// Call between pages or documents etc to free up memory and forget
// adaptive data.
static void ClearAdaptiveClassifier();
// Close down tesseract and free up memory.
static void End();
// Dump the internal binary image to a PGM file.
static void DumpPGM(const char* filename);
protected:
// Copy the given image rectangle to Tesseract, with adaptive thresholding
// if the image is not already binary.
static void CopyImageToTesseract(const UINT8* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top, int width, int height);
// Compute the Otsu threshold(s) for the given image rectangle, making one
// for each channel. Each channel is always one byte per pixel.
// Returns an array of threshold values and an array of hi_values, such
// that a pixel value >threshold[channel] is considered foreground if
// hi_values[channel] is 0 or background if 1. A hi_value of -1 indicates
// that there is no apparent foreground. At least one hi_value will not be -1.
// thresholds and hi_values are assumed to be of bytes_per_pixel size.
static void OtsuThreshold(const UINT8* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top, int right, int bottom,
int* thresholds,
int* hi_values);
// Compute the histogram for the given image rectangle, and the given
// channel. (Channel pointed to by imagedata.) Each channel is always
// one byte per pixel.
// Bytes per pixel is used to skip channels not being
// counted with this call in a multi-channel (pixel-major) image.
// Histogram is always a 256 element array to count occurrences of
// each pixel value.
static void HistogramRect(const UINT8* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top, int right, int bottom,
int* histogram);
// Compute the Otsu threshold(s) for the given histogram.
// Also returns H = total count in histogram, and
// omega0 = count of histogram below threshold.
static int OtsuStats(const int* histogram,
int* H_out,
int* omega0_out);
// Threshold the given grey or color image into the tesseract global
// image ready for recognition. Requires thresholds and hi_value
// produced by OtsuThreshold above.
static void ThresholdRect(const UINT8* imagedata,
int bytes_per_pixel,
int bytes_per_line,
int left, int top,
int width, int height,
const int* thresholds,
const int* hi_values);
// Cut out the requested rectangle of the binary image to the
// tesseract global image ready for recognition.
static void CopyBinaryRect(const UINT8* imagedata,
int bytes_per_line,
int left, int top,
int width, int height);
// Low-level function to recognize the current global image to a string.
static char* RecognizeToString();
// Find lines from the image making the BLOCK_LIST.
static void FindLines(BLOCK_LIST* block_list);
// Recognize the tesseract global image and return the result as Tesseract
// internal structures.
static PAGE_RES* Recognize(BLOCK_LIST* block_list, ETEXT_DESC* monitor);
// Convert (and free) the internal data structures into a text string.
static char* TesseractToText(PAGE_RES* page_res);
};
#endif // THIRD_PARTY_TESSERACT_CCMAIN_BASEAPI_H__
此差异已折叠。
/**********************************************************************
* File: blobcmp.c
* Description: Code to compare blobs using the adaptive matcher.
* Author: Ray Smith
* Created: Wed Apr 21 09:28:51 BST 1993
*
* (C) Copyright 1993, Hewlett-Packard Ltd.
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
*
**********************************************************************/
#ifndef BLOBCMP_H
#define BLOBCMP_H
#include "tstruct.h"
float compare_tess_blobs(TBLOB *blob1,
TEXTROW *row1,
TBLOB *blob2,
TEXTROW *row2);
#endif
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册