Developed as an activity of the Walloon region project CETIC-CEIQS, Retroweb is a tool for data extraction from the Internet. Now that Internet has become one of the main source of information, this kind of tool is now a must for any company.
The Internet can be considered an infinite source of information for both individuals and organizations. Nevertheless, the World Wide Web is inherently hard to use efficiently.
Notably because it is:
Developed as an activity of the Walloon region project CETIC-CEIQS, Retroweb is a tool for data extraction from the Internet. With Retroweb you are able to quickly and visually create data extraction software. Periodically executed, these programs can feed your documentation management system or any internal corporate database.
Retroweb is well fitted for search engines, technological intelligence and website migration to a database or a content management system (CMS)
Retroweb is obviously not the only Internet data-extraction solution. Many scientific project and several well-known companies work on similar solutions.
Several advantages make Retroweb different (and often better):
Retroweb is made of two complementary modules:
Retroweb-Browser is a Java 6 piece of software developed using the Eclipse-RCP framework, Gecko (well-known for being used in Firefox) is the web rendering engine, and the extraction rules are based on XPath , a W3C recommendation.
Retroweb architecture is compliant with the Model-View-Controller principles (MVC) in order to reduce the amount of code written and facilitate the development of new features.
Retroweb-Wrapper is a Java 6 piece of software, well-fitted for batch processing on a server. It takes Retroweb-Browser generated data-extraction rules as an input to generate a structured and interpreted XML data set.
Retroweb was successfully tested on MS-Windows and Ubuntu GNU/Linux
Retroweb is currently an efficient data-extraction rule for the Internet. But it will change along with new technologies and demands originating from the industry. Hence, we are already working on several research topics:
Semantic web interoperability
One of the challenges for the future Internet is its compatibility not only with human users (e.g. better usability) but also with software agents.
The Semantic Web tackles the latter using concepts and tools to enrich web data with tractable meaning. As a semantic annotation tool for the Internet, Retroweb clearly has a role to play.
Self-healing of data extraction rules
If the HTML code of a web-page is deeply modified, a data-extraction rule might not be valid anymore. It is then necessary to detect the failure during the extraction process and to automatically repair the extraction rule.
Integrating Retroweb in a search engine architecture
Legacy search engines collect documents, extract textual content and store it as an index, i.e. a zipped file of terms and documents in which they appear. This indexation process is called “full-text” as it is based only on document syntactic content. On the contrary Retroweb-Wrapper can semantically index document as it is able to use the meaning of extracted data. Integrating Retroweb-Wrapper in a search engine would add a valuable advantage to legacy search engines architectures.