The indexing process
One of the most important steps in order to be able to maintain the system working in a good state is to understand each one of the steps that the program executes to make its work, between these the indexed process is one of the most important. The indexed process works in the following way:
In the cron demon is registered the rundig script, which executes every week at night in order to make the following process. The rundig script executes several processes that are in charge of creating the index of the sites and to analyze the publications that are there, between these processes htsciense is the most important. htsciense enters to the sites given by the user in the htsciense.conf file and begins the indexed process. This it consists of creating an index with the words found in the document, following recursively the links that it found, to analyze the information of the document for getting the information of the publications and to store this information in a text file that will be used for filling the database later.
When htsciense finds a document in a HTML, PDF, PS, DOC, etc, format, calls an external process that executes a filter for transforming the document into a flat text and places the data in a temporary file, from which soon the data are read so as to be analyzed and extract their most important information.
The algorithm used, for analyzing these files, is based on heuristics that use probabilities.
While htsciense is making the indexed, the data file where it places the information of the identified scientific publications is blocked, with a cooperative blocked so as to avoid conditions of competition with other ScienSe's processes.
The information of these publications is stored in a format of tags that can be read by a web application.