ErweiternDruck
 

HADARA80P

An Historical Handwritten Arabic Dataset for Segmentation-Free Word Spotting

This dataset was created within the scope of the HADARA project, which focuses on the analysis of historical handwritten Arabic documents [1]. It is based on the historical handwritten book بَذلُ المَاعُون فِي فَضلُ الطَاعُون whose title might be translated to "About the advantage of the pest". The manuscript was published in 06. 833 AH (Islamic calendar), which corresponds to February 1430 AD. The set includes the first 80 textual pages1 plus one inlay cover. All pages in the dataset, except for the inlay cover and the first page, which contains a short description, have a single main text block, occasionally accompanied by side notes. The text color is primarily black with emphasized words in red ink.

Data Format and Properties

All pages are provided as standard 48-bit TIFF RGB images with 16 bits per color channel. The dimensions are consistently 2882⨯3650 pixels, which corresponds to roughly 380 dpi. The images are stored using the lossless TIFF deflate compression, leading to a file size of 51 MiB per image.

Accompanying the images is an XML file containing the ground truth. Pages, textblocks, textlines as well as single words are annotated. Pages and textblocks are rectangular, while textlines and words are described by detailed polygonal coordinates. For each word and textline, a narrow UTF-8 transcription created by native Arabs is provided. Where appropriate, a number of tags describing linguistic or writing-related properties is assigned to each such word occurrence. The transcription process followed strict rules, which are described in [2].

Additional Data and Baseline Results

Also included in this dataset are the exact word counts, both with and without prior stripping of optional diacritic symbols and tag statistics. Additionally, the results of the HADARA and Ulysse [3] word spotting systems that were obtained using a fixed set of 25 keywords, including a detailed evaluation as described in [2] are provided. They might serve as a baseline for comparison against other word spotting systems.

Mean average precision (MAP) for Ulysse and HADARA word spotting systems on HADARA80P dataset
pIR γLA
HADARA 0.42 0.31
Ulysse 0.35 0.27

Directory Structure

The directory structure of the dataset tarball is as follows:

root
  ├─baseline
  │   ├─hadara
  │   │   └─graphs
  │   │       └─tikz
  │   └─ulysse
  │       └─graphs
  │           └─tikz
  ├─images
  ├─keywords
  │   ├─query_by_image
  │   ├─query_by_polygon
  │   └─query_by_string
  └─stats

The root directory contains this README, license information and the main XML file containing the annotation of the book. All images of the different pages are stored in images. The directory stats contains text files with word frequencies, both with and without prior stripping of diacritic symbols and statistics about the usage of the linguistic and writing-related tags. The keywords directory contains 25 keywords in three different representations: As images, coordinate sets from the main document, as well as the keyword transcription as a string. Finally, baseline contains the results of word spotting by the HADARA and Ulysse word spotters performed on the dataset. Each of those two subdirectories contains a .result file that holds an XML tree describing the obtained matches, the files values.csv and values.xlsx with different quality measures obtained during evaluation of the results, and a directory graphs. This directory contains Precision-Recall graphs and histograms of the distribution of the different quality measures described in [4] for each keyword. The tikz subdirectory contains alternative forms of these graphs for easy integration into LaTeX documents.

See also

This dataset was created within the scope of the HADARA project. The HADARA Tool can view and edit the dataset out-of-the-box and provides many useful applications, such as word spotting and transcription searching.

License

HADARA80P by Institute for Communications Technology, Technische Universität Braunschweig is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You are free to use the HADARA80P dataset in accordance with this License. We appreciate references to the dataset paper whenever the dataset or parts of it have been used in any way related to your work:

1 Due to scanning reasons, page no. 55 of the manuscript is missing in the dataset.

Contact

In case of any question, please do not hesitate to contact Werner Pantke (email).

References

  1. W. Pantke, V. Märgner, D. Fecker, T. Fingscheidt, A. Asi, O. Biller, J. El-Sana, R. Saabni, and M. Yehia, "HADARA – A software system for semi-automatic processing of historical handwritten Arabic documents" in Proc. Archiving Conf. 2013, Washington DC, USA, April 2013, pp. 161–166.
  2. W. Pantke, M. Dennhardt, D. Fecker, V. Märgner and T. Fingscheidt, "An Historical Handwritten Arabic Dataset for Segmentation-Free Word Spotting – HADARA80P" in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, Heraklion, Greece, September 2014, pp. 15-20
  3. CoReNum. Ulysse - universal search engine
  4. W. Pantke, V. Märgner and T. Fingscheidt, "On Evaluation of Segmentation-Free Word Spotting Approaches without Hard Decisions" in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, Washington DC, USA, August 2013, pp. 1300-1304

News

2015-08-27Repackaged HADARA80P v1.1 due to corrupted image files (CRC error, thanks to Galal Bin Makhashen for reporting). Updated Readme.
2015-06-22Released HADARA80P v1.1 containing a ground truth update (verified annotation).
 

HADARA80P Download Request

HADARA80P
captcha