DCTFinder Documentation
Extract title and creation time from web page.
Status: Beta
Brought to you by:
xtannier
DCTFinder requires java >= 7.
In order to use DCTFinder, you need to install CRF tool Wapiti.
You will provide the path to Wapiti binary execution file (called wapiti) when creating the DCTExtractor object in java.
The following external JAR is required:
The basic usage of DCTFinder for extracting the page information of a web page is as follows:
// File or Path to Wapiti binary file
File wapitiFile = new File("/path/to/wapiti");
// or: Path wapitiFile = Paths.get("/path/to/wapiti");
// Create DCT extractor
DCTExtractor extractor = new DCTExtractor(wapitiBinaryFile);
// Specify locale (can be Locale.US, Locale.UK, Locale.FRANCE, Locale.FRENCH, ...)
Locale locale = Locale.ENGLISH;
// Create URL
URL url = new URL(...);
// open inputstream from a file or a URL.
InputStream is = ...;
// Get download date (Calendar object)
// Knowing download date will lead to better results,
// but it can be set to null
Calendar date = new GregorianCalendar();
// Get page info
PageInfo pageInfo = extractor.getPageInfos(is, url, locale, date);
// Get download date
Calendar dctCalendar = pageInfo.getDCT();
Date calendarDate = dctCalendar.getTime();
// Get title
String title = pageInfo.getTitle();
If you want to train or evaluate the system (with new dataset or in a new language), use the main in class DCTExtractorTrainingAndEvaluation with options OPTION_MODE_TRAIN, OPTION_MODE_TEST or OPTION_MODE_CROSS_VALIDATION.
TBD (contact me if needed)