PEPS CNRS_PSL EITAB

Lattice, AOrOc, INRIA


Last modification : 10 November 2016

Extraction of textual information in order to automatically feed the database, information transfers and thesauri evolution.

The purpose is to extract character fields of text that are formatted in catalogues, together with LATTICE and the INAR MIG, in order to feed spreadsheets for our bases and atlas. The tests currently concern the CAG (“cartes archéologiques de la Gaule” archaeological maps of Gaul).

The cooperation between archaeological searchers, linguists and computer scientists aims to design and validate an automated processing of corpora, in order to reduce the intervention time, to increase the results reliability and to facilitate interdisciplinary data share (statistical processing, backgrounds, themes…): in other words, in order to improve the study and research environment of the archaeologist.

 

 Graduated goals

How to automatically gather in order to feed the databases ? Are there already tools on the market

Based on an observation, there is :

  • On the one hand, a quite precise thesaurus that appears as lists of values associated with databases sections.
  • On the other hand sites catalogues (paper or digital format) associated with theses or publications.

PNG
 
 

First we had to extract simple data : places, datings, structures, bibliographic references…
 
_
 

Then we try to extract strings of character that were the same as the ones from our thesaurus and we faced two kind of difficulties:

  • Finding a new self-learning software that allows it: it isn’t marketed yet.
  • Solving the problem of the research evolution and vocabulary adaptation to new interpretations in the long run.
  • Overcoming the disadvantage of a vocabulary that is not very codified and a community that pays attention to her writings, avoids repetitions, plays with synonyms and sentence structures, it is far away from biology or pharmacy texts that are first concerned with this kind of study

 Coins catalogues analyses

We first worked on very structured texts such as coins catalogues. These include:

  • A title, a description of each coin’s side, a bibliography
  • A set of coins with an inventory number, an alloy, measures of weight, diameter, thickness, and a provenance.
    Once the document structure is identified, the breakdown is easy.
  • The last step is their integration in the database.

 Analyse of an archaeological map of Gaul

Secondly, we chose commented bibliographical volumes that were categorized according to the communes of origin and we tested several tools on it.
Currently we have defined an analysing tools chain and we are installing it on two machines to see if we can move from experiment to production.

The next step will be the whole analyse of a book and our ambition is to integrate this research and extraction tool to our online publications.