DCTFinder Documentation

Extract title and creation time from web page.

Status: Beta

Brought to you by: xtannier

Home

Authors:

There is a newer version of this page. You can find it here.

DCTFinder Quick Start

Requirements

Java

DCTFinder requires java >= 7.

Wapiti (Conditional Random Fields)

In order to use DCTFinder, you need to install CRF tool Wapiti.
You will provide the path to Wapiti binary execution file (called wapiti) when creating the DCTExtractor object in java.

Common JARs

The following external JAR is required:

Apache Commons Lang

Download

Java API example

The basic usage of DCTFinder for extracting the page information of a web page is as follows:

// File or Path to Wapiti binary file
File wapitiFile = new File("/path/to/wapiti");
// or: Path wapitiFile = Paths.get("/path/to/wapiti");

// Create DCT extractor
DCTExtractor extractor = new DCTExtractor(wapitiBinaryFile);

// Specify locale (can be Locale.US, Locale.UK, Locale.FRANCE, Locale.FRENCH, ...)
Locale locale = Locale.ENGLISH; 
// Create URL
URL url = new URL(...); 
// open inputstream from a file or a URL.
InputStream is = ...; 
// Get download date (Calendar object)
// Knowing download date will lead to better results,
//    but it can be set to null
Calendar date = new GregorianCalendar();

// Get page info
PageInfo pageInfo = extractor.getPageInfos(is, url, locale, date);

// Get download date
Calendar dctCalendar = pageInfo.getDCT();
Date calendarDate = dctCalendar.getTime();

// Get title
String title = pageInfo.getTitle();

If you want to train or evaluate the system (with new dataset or in a new language), use the main in class DCTExtractorTrainingAndEvaluation with options OPTION_MODE_TRAIN, OPTION_MODE_TEST or OPTION_MODE_CROSS_VALIDATION.

Building new resources

TBD (contact me if needed)