hi Ulf,
is there any easy way to change the Lucene Analyzer used by Jtrac to allow underscores in words ?
The standard analyzer will word-break at underscores, but what if you wish to maintain words containing underscores?
thanks
jim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When I create a ticket with "under_score" in it, I can find it by both searching for "score" and "under_score" (which makes sense, as both indexing and searching should be using the same analyzer).
What behavior would you like to see: not finding it when searching for "score"? Or something else?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And to answer the actual question - no, there is no easy way to change the analyzer through a config change. But I wouldn't mind added configurability for a good use case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks Ulf.
its more about the false positives than not eventually finding a hit you actually wanted.
with technical and IT text its not unusual to have compound words for object names etc
eg 'This_is_my_very_special_routine_name'. ( a lot of older source code does not use bumpy or camel case for routine names etc, and if maintaining issues on that code base, hyphenated and underscored words are common)
the only hit I would be interested in would be when the whole compound word/phrase matches, not interested in hits of any of the component words by themselves. Also although small words such as 'is' might be dropped as noise during indexing, they may form an important part of what you are really interested in finding, eg we dont want to see 'This_was_my_very_special_routine_name' as a hit, its an entirely different object.
trying fancy searches like weighting and proximity is tedious and error prone and if an important component word of the compound name is dropped as noise word by the indexer an exact match cant be guaranteed.
a simple white-space analyzer would suffice I think for a lot of simple searches on the issues in my case.
i tried to figure out how to use the spring framework with its lucene components, so I could build by own version of the indexer and search for Jtrac but have no experience in that area, so failed miserably.
Last edit: Jim Murray 2023-08-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see. I'm a bit swamped now, and about to go on vacation, but I'll look into what might be involved in adding a bit of configurability in a few weeks time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi Ulf,
is there any easy way to change the Lucene Analyzer used by Jtrac to allow underscores in words ?
The standard analyzer will word-break at underscores, but what if you wish to maintain words containing underscores?
thanks
jim
When I create a ticket with "under_score" in it, I can find it by both searching for "score" and "under_score" (which makes sense, as both indexing and searching should be using the same analyzer).
What behavior would you like to see: not finding it when searching for "score"? Or something else?
And to answer the actual question - no, there is no easy way to change the analyzer through a config change. But I wouldn't mind added configurability for a good use case.
thanks Ulf.
its more about the false positives than not eventually finding a hit you actually wanted.
with technical and IT text its not unusual to have compound words for object names etc
eg 'This_is_my_very_special_routine_name'. ( a lot of older source code does not use bumpy or camel case for routine names etc, and if maintaining issues on that code base, hyphenated and underscored words are common)
the only hit I would be interested in would be when the whole compound word/phrase matches, not interested in hits of any of the component words by themselves. Also although small words such as 'is' might be dropped as noise during indexing, they may form an important part of what you are really interested in finding, eg we dont want to see 'This_was_my_very_special_routine_name' as a hit, its an entirely different object.
trying fancy searches like weighting and proximity is tedious and error prone and if an important component word of the compound name is dropped as noise word by the indexer an exact match cant be guaranteed.
a simple white-space analyzer would suffice I think for a lot of simple searches on the issues in my case.
i tried to figure out how to use the spring framework with its lucene components, so I could build by own version of the indexer and search for Jtrac but have no experience in that area, so failed miserably.
Last edit: Jim Murray 2023-08-30
I see. I'm a bit swamped now, and about to go on vacation, but I'll look into what might be involved in adding a bit of configurability in a few weeks time.