Problem: XQuery examples with node() parameters failed with empty variables
EmptyVariable.INSTANCE instead of executing child <get> elements<xq-param> and <xq-expression> child elementsSolution (TDD + 3 commits):
executeChildElements() to execute nested <get> in <xq-param>super.executeBody(context) (1 line fix!)Test Results:
XQueryNodeParameterTest passestotalProducts = 3, totalValue = 77.24, averagePrice = 25.75See XQUERY_FIX_SUMMARY.md for complete technical details.
TDD Approach Success: Fixed two critical bugs discovered through comprehensive testing!
Problem: HttpPlugin (and all plugins with @Inject dependencies) didn't work!
newInstance() (reflection)HttpService was null → all HTTP requests failedSolution: Use InjectorHelper.getInjector().getInstance(pluginClass)
@Inject dependencies now properly injectedProblem: Fixed critical bug in JSRScriptEngineAdapter.copyVariables() that existed since 2006!
Problem:
Variable objects instead of their values"Operator '+' inappropriate for objects"${counter + 1}, ${count < 5}, all arithmetic/comparison expressionsSolution:
Variable.getWrappedObject() (was only for ScriptingVariable)String→Number conversion for arithmetic operationsImpact:
Test Coverage:
BaseTemplaterStaticInjectionTest - 3 tests verifying Guice static injectionJSRScriptEngineAdapterVariableConversionTest - 4 tests for BeanShell type conversionCHANGELOG.md for complete detailsStatus: ✅ IMPLEMENTED via TDD (Red/Green cycle)
Severity: HIGH - Resource leaks, memory corruption
Component: AbstractPlugin execution lifecycle
Original Problem (BaseProcessor architecture):
When execute() threw an exception, cleanup code was NOT executed:
scraper.finishExecutingProcessor() - stack cleanup skippedsetProperty() - execution time not recordedwriteDebugFile() - debug output lostv2.2.0 Analysis - SAME PROBLEM EXISTS!
// AbstractPlugin.execute() - BEFORE FIX
try {
return doExecute(context);
} catch (Exception e) {
throw new RuntimeException(...);
}
// ❌ NO finally block - cleanup skipped on exception!
Solution - TDD Implementation:
// AbstractPlugin.execute() - AFTER FIX
try {
return doExecute(context);
} catch (Exception e) {
throw new RuntimeException(...);
} finally {
// ✅ Cleanup ALWAYS executed!
try {
doCleanup();
} catch (Exception cleanupEx) {
LOGGER.warn("Cleanup failed: {}", cleanupEx.getMessage());
}
}
Impact:
Test Coverage:
AbstractPluginExceptionBug5Test - 4 unit tests (TDD red/green)ExceptionHandlingIntegrationBug5Test - 4 integration testsUsage:
No code changes required - all plugins automatically benefit from the fix!
The cleanup mechanism is built into AbstractPlugin.execute() and works transparently.
Status: ✅ IMPLEMENTED (Patch #7 + Detection)
Severity: MEDIUM - Wrong scope behavior, but safe
Component: JSRScriptEngineAdapter
Original Problem (SourceForge Patch #7):
Function parameters stay in ScriptEngine "forever" - second call sees first call's parameters!
<function name="func"><return>${param1}</return></function>
<!-- First call WITH param1 -->
<call name="func">
<call-param name="param1">VALUE</call-param>
</call>
<!-- Second call WITHOUT param1 -->
<call name="func">
<!-- NO param1! Should be void, but gets 'VALUE' from first call! -->
</call>
v2.0 behavior: param1 = void (correct)
v2.1+ behavior: param1 = 'VALUE' from first call ❌
Root Cause:
copyVariables() only ADDED variables via put() but never CLEARED old ones!
Solution - Hybrid Approach:
// AFTER FIX
private void copyVariables(DynamicScopeContext context) {
detectScopePollution(context); // WARN or THROW
clearJSRScriptContextAttributes(); // CLEAR all bindings
// ... copy variables
}
Detection Modes:
-Dwebharvest.script.scope.strict=trueBeanShell Limitation:
Due to BeanShell's internal bsh.NameSpace architecture, complete v2.0 scope semantics not achievable via JSR-223 API alone. Detection mechanism provides safety net.
Impact:
Best Practice:
<!-- ALWAYS provide all parameters explicitly -->
<call name="func">
<call-param name="param1"></call-param> <!-- Empty if not used -->
</call>
Test Coverage:
FunctionParameterScopeBug7Test - 6 unit tests (4 active, 2 disabled with docs)FunctionScopeIntegrationBug7Test - 3 integration testsProblem: Configuration duplication! GUI used webharvest.properties, CLI required all args explicitly.
option=value)Solution: Unified settings system with modern Apache Commons CLI!
~/.webharvest/webharvest.properties-option value format (e.g., -c config.xml -w /tmp)option=value format still worksUsage:
# Configure once in ~/.webharvest/webharvest.properties
workdir=/projects/scraping
proxyhost=proxy.company.com
# CLI automatically loads settings
java -jar webharvest-cli.jar -c scraper.xml
# Override when needed
java -jar webharvest-cli.jar -c scraper.xml -w /tmp
Test Coverage:
WebHarvestSettingsBug39Test - 6 tests for settings loadingCommandLineParserBug39Test - 6 tests for CLI parsingWebHarvestCLIIntegrationBug39Test - 6 integration testsProblem: After successful login, HTTP 302 redirect logs user out because headers (Referer, auth tokens) are lost.
Solution: Configurable HttpService with header preservation!
Usage:
// Preserve headers across redirects (login workflows)
Map<String, String> headers = new HashMap<>();
headers.put("Referer", "https://example.com/login");
headers.put("X-API-Key", "secret");
HttpService service = new HttpServiceImpl(true, headers); // followRedirects=true
// Disable redirects for manual handling
HttpService service = new HttpServiceImpl(false);
Test Coverage:
HttpServiceRedirectBug36Test - 6 unit testsHttpRedirectBug36Test - 2 tests + 4 documented scenarios HttpRedirectIntegrationBug36Test - 6 integration tests with real scenariosProblem: OutOfMemoryError with 4000+ loop iterations, even with 2GB heap!
<def>, <xpath>, <html-to-xml> variables<empty> as workaroundSolution: Automatic cleanup via nested context!
<empty> wrapper no longer neededImpact:
Before: 4000 items × 3MB/item = 12GB → OOM at iteration 600
After: Constant 3MB (cleanup after each iteration) → ✅ 10,000+ items OK
Test Coverage:
LoopMemoryLeakBug19Test - 4 tests demonstrating bug and fixUser Question: "Currently it handles only exception of BaseException type and there is no way to handle other exceptions"
Verification: TryPlugin ALREADY catches ALL exceptions!
catch (Exception e) - highest level, catches everythingConclusion: Bug #34 was already fixed in previous versions. User's intuition was correct: "Może to już mamy :)" → YES!
Test Coverage:
TryPluginExceptionTypesBug34Test - 6 verification testsUser Request: "It would be better, if in the function one can refer to global variables, not just the function parameters"
Verification: Functions ALREADY access global variables!
${globalVar} for any variable in scopeExample:
<!-- Global config -->
<def var="apiHost">https://api.example.com</def>
<def var="apiKey">secret</def>
<!-- Function uses globals + parameter -->
<function name="apiGet">
<return>
<http url="${apiHost}/<call-param name='endpoint'/>"
timeout="5000">
<http-header name="X-API-Key" value="${apiKey}"/>
</http>
</return>
</function>
<!-- Call with only the changing parameter -->
<call name="apiGet"><call-param name="endpoint">users</call-param></call>
Test Coverage:
FunctionGlobalVariablesBug6Test - 4 verification testsUser Request: var-defs append, list as first-class tag, list append/add feature.
<def var="x">${x}${new}</def> works<list> IS @CorePlugin!Problem: User spent 8 hours stuck on: form submit, pagination, detail scraping.
tutorial-form-submit-pagination.html<http method="POST"> IS the submit!<while> + "Next" button checkProblem: "Debugging is really complicated without having a possibility to look at what WebHarvest is actually processing"
debugging-guide.html - 5 debugging patternsdebug/response-${sys.timestamp()}.html)<log level="DEBUG">)Problem: "Max connections/second to avoid break down of webserver. Currently webharvest downloads with maximum possible connection count"
rate-limiting-guide.html - 6 rate limiting patternsRequest: Support HTMLSQL/WebSQL for SQL-like HTML queries
a.link vs //a[@class='link']User Request: "There needs to be a way to check for this required structure - to ensure that the corresponding processor can be run"
Verification: WebHarvest ALREADY provides complete DOM validation!
count(//div[@class='product']) > 0Example:
<!-- Validate DOM structure before processing -->
<def var="hasPrice">
<xpath expression="count(//span[@class='price']) > 0">${page}</xpath>
</def>
<if condition="${hasPrice}">
<!-- ✅ Structure OK - safe to extract -->
<def var="price">
<xpath expression="//span[@class='price']/text()">${page}</xpath>
</def>
</if>
<else>
<log message="Required DOM structure not found"/>
</else>
Test Coverage:
DomValidationBug14Test - 6 validation pattern examplesStatus: ✅ IMPLEMENTED
New <log> plugin for writing to stdout/stderr - perfect for debugging and CI/CD!
Usage:
<!-- Log to stdout (default) -->
<log>Processing item ${i}...</log>
<!-- Log to stderr for errors -->
<log dest="err">ERROR: ${sys.exception}</log>
<!-- Progress monitoring -->
<loop item="url" index="i">
<list>${urls}</list>
<log>Processing page ${i}: ${url}</log>
<http url="${url}"/>
</loop>
Features:
dest="out" (default) or dest="err"${var}Test Coverage:
LogPluginFeature2Test - 4 unit tests (stdout/stderr separation)LogPluginIntegrationFeature2Test - 5 integration tests (real scenarios)Use Cases:
vs. Alternatives:
<echo> - stores in variables (not console)<log> - writes to console (perfect for monitoring)WebHarvest now features a modern, web-based IDE that runs as a standalone Java application with embedded HTTP server!
Key Features:
java -jar webharvest-ide-2.2.0.jarQuick Start:
cd webharvest-ide
java -jar target/webharvest-ide-2.2.0.jar
# Opens automatically at http://localhost:8080
Problem: Variables created by <def> elements were executed in nested scopes and cleaned up after configuration completed, making them inaccessible to IDE and other tools.
Solution: ConfigPlugin.executeBody() now executes children directly in ROOT scope instead of creating nested scopes.
Impact:
ScraperExecutionEndEvent now includes the execution context, allowing tools to extract variables before scope cleanup.
WebHarvest is an open-source web data extraction tool written in Java. It provides a powerful and flexible framework for collecting web pages and extracting useful data from them using XML-based configuration files.
-Dwebharvest.plugin.packages=... and -Dwebharvest.schema.validation=false<connection id="myDB" .../><database connection="myDB" sql="..."/>WebHarvest IDE is a modern, web-based development environment for creating and testing scraping configurations. It runs as a standalone Java application with an embedded Jetty server.
configResult)# From project root
cd webharvest-ide
java -jar target/webharvest-ide-2.2.0.jar
# IDE opens automatically at http://localhost:8080
# If port 8080 is busy, it will try ports 8081-8179
┌─────────────────────────────────────────┐
│ Browser (http://localhost:8080) │
│ ├─ Monaco Editor (XML editing) │
│ ├─ REST API (config validation) │
│ └─ WebSocket (live execution updates) │
└─────────────────────────────────────────┘
↓ HTTP/WebSocket
┌─────────────────────────────────────────┐
│ Embedded Jetty Server (Java) │
│ ├─ ConfigServlet (/api/config) │
│ ├─ ExecutionServlet (/api/execute) │
│ ├─ ExecutionWebSocketServlet (/ws) │
│ └─ StaticFileHandler (/webapp) │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ WebHarvest Core Engine │
│ └─ ExecutionManager (thread pool) │
└─────────────────────────────────────────┘
docs/web/ styling with WebHarvest brandingDownload the latest release from GitHub Releases or build from source:
git clone https://github.com/webharvest/webharvest.git
cd webharvest
mvn clean install
<dependency>
<groupId>org.webharvest</groupId>
<artifactId>webharvest-core</artifactId>
<version>2.2.0</version>
</dependency>
dependencies {
implementation 'org.webharvest:webharvest-core:2.2.0'
// Optional: External plugins
implementation 'org.webharvest:webharvest-database:2.2.0'
implementation 'org.webharvest:webharvest-mail:2.2.0'
implementation 'org.webharvest:webharvest-ftp:2.2.0'
}
Create a simple configuration file example.xml:
<?xml version="1.0" encoding="UTF-8"?>
<config>
<http url="https://example.com">
<xpath expression="//title" var="title"/>
<xpath expression="//p" var="paragraphs"/>
</http>
<file path="output.xml">
<template>
<results>
<title>${title}</title>
<content>${paragraphs}</content>
</results>
</template>
</file>
</config>
Run the scraper:
java -jar webharvest-cli.jar example.xml
import org.webharvest.runtime.Scraper;
import org.webharvest.runtime.ScraperContext;
import org.webharvest.definition.Config;
// Load configuration
Config config = Config.fromFile("example.xml");
// Create scraper context
ScraperContext context = new ScraperContext();
// Execute scraping
Scraper scraper = new Scraper(config);
scraper.execute(context);
// Get results
String result = context.getVar("title").toString();
WebHarvest follows a modular, plugin-based architecture:
webharvest/
├── webharvest-core/ # Core framework and plugins
├── webharvest-cli/ # Command-line interface
├── webharvest-ide/ # IDE integration
└── docs/ # Documentation
WebHarvest provides a rich set of built-in plugins:
WebHarvest uses XML Schema for configuration validation:
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://org.webharvest/schema/2.1/core">
<!-- Plugin configurations -->
</config>
Each plugin supports specific attributes and body content:
<http url="https://example.com" method="GET" timeout="30000">
<http-header name="User-Agent" value="WebHarvest/2.1.0"/>
<http-param name="format" value="json"/>
<xpath expression="//title" var="title"/>
</http>
Variables can be referenced using ${variableName} syntax:
<template>
<result>
<title>${title}</title>
<timestamp>${currentTime}</timestamp>
</result>
</template>
WebHarvest has comprehensive test coverage across all modules:
Infrastructure and utility tests
webharvest-database: Integration tests for database plugins
# Run all tests
mvn test
# Run tests for specific module
cd webharvest-core
mvn test
# Run with coverage report
mvn clean test jacoco:report
# Report: target/site/jacoco/index.html
The IDE has professional integration tests that verify:
cd webharvest-ide
mvn test
# Tests: ExecutionManagerWebSocketTest
Test Results:
✓ testSimpleDefVariables - Variable extraction
✓ testXPathPluginVariables - XPath processing
✓ testMultiVariablePipeline_CRITICAL - Full pipeline
✓ testLogStreaming - Log delivery
<config>
<http url="https://news.ycombinator.com">
<xpath expression="//a[@class='storylink']" var="links"/>
<xpath expression="//span[@class='score']" var="scores"/>
</http>
<file path="hackernews.xml">
<template>
<news>
<#foreach link in links>
<item>
<title>${link.text}</title>
<url>${link.@href}</url>
<score>${scores[link.index].text}</score>
</item>
</#foreach>
</news>
</template>
</file>
</config>
<config>
<database connection="jdbc:mysql://localhost:3306/test"
jdbcclass="com.mysql.jdbc.Driver"
username="user" password="pass">
SELECT * FROM users WHERE active = 1
</database>
<http url="https://api.example.com/users" method="POST">
<template>
<users>
<#foreach user in databaseResult>
<user>
<id>${user.id}</id>
<name>${user.name}</name>
<email>${user.email}</email>
</user>
</#foreach>
</users>
</template>
</http>
</config>
<config>
<script language="javascript">
var data = [];
for (var i = 0; i < 10; i++) {
data.push({
id: i,
value: Math.random() * 100
});
}
context.setVar("generatedData", data);
</script>
<file path="output.json">
<template>${generatedData}</template>
</file>
</config>
AbstractPluginBase class for all plugins providing common functionality:
public abstract class AbstractPlugin {
protected abstract Variable doExecute(DynamicScopeContext context)
throws PluginException, InterruptedException;
protected abstract void doValidateConfiguration(PluginConfiguration config)
throws PluginValidationException;
}
DynamicScopeContextRuntime context for variable management and execution state:
public class DynamicScopeContext {
public void setVar(String name, Variable value);
public Variable getVar(String name);
public String getCharset();
}
PluginConfigurationConfiguration management for plugins:
public interface PluginConfiguration {
String getProperty(String name);
String getProperty(String name, String defaultValue);
void setProperty(String name, String value);
}
To create a custom plugin:
AbstractPlugin@CorePlugin annotation@CorePlugin(elementName = "myplugin")
public class MyPlugin extends AbstractPlugin {
@Override
protected Variable doExecute(DynamicScopeContext context)
throws PluginException, InterruptedException {
// Plugin implementation
return new NodeVariable("result");
}
@Override
protected void doValidateConfiguration(PluginConfiguration config)
throws PluginValidationException {
// Configuration validation
}
}
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the BSD License.
Copyright Notice:
See licences/webharvest_licence.txt for the full license text and NOTICE for detailed attribution information.
We thank all contributors and users who help improve WebHarvest!
Note: This project has undergone substantial modernization and refactoring in 2024-2025, including migration to Java 11+, complete architecture redesign, web-based IDE implementation, and numerous enhancements. The modern plugin architecture, event-driven system, and cloud-ready infrastructure were designed starting from 2012. See NOTICE for detailed contribution history.
WebHarvest - Making web data extraction simple and powerful! 🕷️✨