Modern Access to Historical Sources
A Cross-border cooperation program between Free State of Bavaria and the Czech Republic
Project 211 of the EU is aimed at making historical documents from Bavaria and the Czech Republic searchable by the broad public and experts alike in order to facilitate researching the common history shared by the two states. Automatic analysis indexing and mining of archives is a fundamental part of making such archives accessible.
Indexing and making searchable large databases of historical documents consists of many technical problems grouped in the following groups. The initial focus for the work on handwritten analysis is the collection of over 7000 Chronicles available in the Porta Fontium portal. Several aspects of this task make it unique and in need of high innovation in order to work.
- The size of the archive, possibly over a million pages.
- The need for real-time searching and using.
- The use of extremely heterogeneous documents.
- The multilingual nature of the collection.
- The existence of documents in highly infective languages results in extremely big dictionaries. Due to the above factors, employing state-of-the-art techniques
Index and Page Segmentation:
Before any understanding of a documents page is possible, the page must be partitioned into paragraphs text-lines, words, and letters.
The identity of the scribe of a document is possibly the most important information in analysing documents.
While OCR is practically solved for modern high quality documents, historical printed texts and even worst handwritten texts is not performing to a level where it can be considered usable. Word spotting is a means to make content of the documents accessible when reading. In essence word-spotting can be thought as visual CTRL+F inside a document.
- Porta Fontium
- FAU, University of Erlangen–Nuremberg, Pattern Recognition Lab
- University of West Bohemia, NLP group
- Non-deterministic Behavior of Ranking-based Metrics when Evaluating Embeddings (https://arxiv.org/pdf/1806.07171.pdf)