Expertises:
Data Science ⊕
Factsheet:
Nowadays the Internet has become a rich and invaluable source o information for businesses. In order to stay competitive, companies have to cope with this large amount of unstructured information. In this context, the CRAQ-Reverse project aims at providing a tool-supported methodology for web data extraction (also called web wrapping).
Web wrapping techniques transform unstructured web data sources to structured and semantically rich content which can be more easily interpreted and automatically used by computers.
The research team has developed Retroweb, a tool that generates extraction rules for web data sources (mostly web pages). The main benefit from using Retroweb is the graphical interface implemented to analyse web pages and extract data. Thanks to this component, Retroweb becomes very easy to use even for non-technical users. The generic approach adopted by CETIC allows Retroweb to be used in many contexts and applications: customised search engines, migration of (semi-) static web sites, toolboxes for competitive intelligence, etc. Technically, Retroweb is a Java 6 application based on the Eclipse framework; it uses the Firefox rendering engine to display html data.
The team has also developed strong expertise in document management and search engines. They have created a toolbox for crawling documents, extracting text from any common format (doc, pdf, html, rtf, ppt, etc.), and indexing document content.
The project ended in mid-2008 with several positive achievements. The wide range of targeted application has led to several missions in the fields of eHealth, document management, chemistry, and database management systems. Starting from a research prototype, Retroweb has been brought towards a fully functional and finalised product. In order to encourage the use of the tool, documentation has also been a major focus.
On the CRAQ-Reverse project, CETIC acts as a project leader and R&D provider. CETIC provides the tool-supported methodology and transfers its know-how to local SMEs according to their specific needs.
The expertise of the team in web data extraction, search engines and knowledge management has led to the realisation of missions in a wide range of application domains. Besides the development of its own tool for web data extraction, CETIC has notably implemented Illicopresto (Agoria), a web search engine focused on innovation in Wallonia, and ArcheWeb (DocLedge), a toolbox for competitive intelligence over the Internet.
31.01.2006
31.01.2006
a search engine focused on Walloon innovation
En savoir plus
18.04.2005
18.04.2005
Searching the web
En savoir plus
19.10.2004
19.10.2004
Reverse-engineering
En savoir plus
30.09.2003
30.09.2003
This article describes a method for web sites reverse engineering. It is composed of five processes: Web pages classification, HTML cleaning,...
En savoir plus
Publications
03.04.2006
03.04.2006
Publications scientifiques
Estiévenart F., Meurisse J.-R., Hainaut J.-L., Thiran P., Semi-automated Extraction of Targeted Data from Web Pages, Proc. of the 22nd...
En savoir plus
01.01.2005
01.01.2005
Publications scientifiques
Thiran P., Estiévenart F., Hainaut J.-L., Houben G.-J, A Generic Framework For Extracting XML Data From Legacy Databases, Journal of Web...
En savoir plus
08.06.2004
08.06.2004
Publications scientifiques
Thiran P., Estiévenart F., Hainaut J-L., Houben G-J., Exporting Databases in XML : a Conceptual and Generic Approach, WISM’04 : Web Information...
En savoir plus
22.09.2003
22.09.2003
Publications scientifiques
Estiévenart F., François A., Henrard J., Hainaut J-L., A tool-supported method to extract data and schema from web sites, Proc. of the 5th...
En savoir plus