DCTFinder Documentation

Extract title and creation time from web page.

Status: Beta

Brought to you by: xtannier

Home

Authors:

DCTFinder Quick Start

License

DCTFinder is released under CeCILL free software license agreement

Requirements

Java

DCTFinder requires java >= 7.

Wapiti (Conditional Random Fields)

In order to use DCTFinder, you need to install CRF tool Wapiti.
You will provide the path to Wapiti binary execution file (called wapiti) when creating the DCTExtractor object in java.

Common JARs

The following external JAR is required:

Apache Commons Lang
Apache Commons CLI (needed for training only)

Java API example

The basic usage of DCTFinder for extracting the page information of a web page is as follows (also provided in class fr.limsi.dctfinder.Test):

import java.io.*;
import java.util.*;
import java.net.URL;


[...]


// File or Path to Wapiti binary file
File wapitiBinaryFile = new File("/path/to/wapiti");
// or: Path wapitiBinaryFile = Paths.get("/path/to/wapiti");

// Create DCT extractor
DCTExtractor extractor = new DCTExtractor(wapitiBinaryFile);

// Specify locale (can be Locale.US, Locale.UK, Locale.FRANCE, Locale.FRENCH, ...)
Locale locale = Locale.ENGLISH; 
// Create URL
URL url = new URL(...); 
// Open inputstream from a downloaded file or directly from the URL.
InputStream is = ...; 
// Get download date (Calendar object)
// Knowing download date will lead to better results,
//    but it can be set to null
Calendar downloadDate = new GregorianCalendar();

// Get page info
// the URL (second parameter) is used to detect a specific locale (e.g. UK), in case
//     a more general one is specified (e.g. ENGLISH)
// Specific locales are important, because ways to write dates can be very different
//     in different countries spaeking the same language (e.g. US versus UK)
// If we know in advance from which country the page is, specify the country.
// DCTFinder provides extraction rules for Locale.UK, Locale.US and Locale.FRENCH,
//     but we can specify your own rules
PageInfo pageInfo = extractor.getPageInfos(is, url, locale, downloadDate);

// Get download date
Calendar dctCalendar = pageInfo.getDCT();
Date calendarDate = dctCalendar.getTime();

// Get title
String title = pageInfo.getTitle();

Building new resources

If you want to adapt DCTFinder to a specific problem or a new language, you should first read the LREC paper that describes the system (see reference below).
Then, you can edit the rules or add a new set of rules for a brand new language. If you intend to do so, feel free to contact me for any help.

Language-specific rules

Organization and syntax of language-dependent rules are described here.

Training the system

If you want to train or evaluate the system (with new dataset or in a new language), use the main method in class DCTExtractorTrainingAndEvaluation with options OPTION_MODE_TRAIN, OPTION_MODE_TEST or OPTION_MODE_CROSS_VALIDATION.

Publication

The system is described in the following LREC paper:

Xavier Tannier. "Extracting News Web Page Creation Time with DCTFinder", Proceedings of the 9th Language Resources and Evaluation Conference (LREC 2014). 2014, Reykjavik, Iceland.

Documentation: rules