XeTeX - Unicode-based TeX / Bugs / #84 unicode-char-prep.pl wrongly parse UnicodeData.txt for hangul scripts.

unicode-char-prep.pl wrongly parse UnicodeData.txt for hangul scripts.

Status: Beta

Brought to you by: jfkew, kberry, reutenauer

#84 unicode-char-prep.pl wrongly parse UnicodeData.txt for hangul scripts.

Milestone: v0.99992

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2014-08-29

Created: 2013-08-29

Creator: Leo Liu

Private: No

In UnicodeData.txt, there are some lines like

AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;

to specify a range 0xAC00--0xD7A3. It isn't parsed properly by unicode-char-prep.pl. Therefore, almost all hangul characters are not letters in unicode-letters.tex.

In fact, we don't need to load LineBreaks.txt to know whether a character is a letter or not. The macro \ID should not change the catcode. So we should also extract the information from

3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

20000;<CJK Ideograph Extension B, First>;Lo;0;L;;;;;N;;;;;
2A6D6;<CJK Ideograph Extension B, Last>;Lo;0;L;;;;;N;;;;;
2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;

in UnicodeData.txt

The problem affects the use of our xeCJK package. I hope it will be fixed soon.

Discussion

Khaled Hosny - 2013-08-30

I don’t speak Perl, unfortunately (that script was written by Jonathan Kew of course), so patches are highly appreciated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

rink - 2013-11-29

Does something like this work?

It generates 80+k new entries from the ranges.

xetex-handle-unidata-ranges.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Khaled Hosny - 2013-12-06

Leo Liu, does this patch fix the issue for you?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Leo Liu - 2013-12-06

Yes, rink's patch works. Thanks.

And I think the line

print " \\global\\catcode\\n=11" if m/ID/;

can be safely removed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Khaled Hosny - 2014-07-25

status: open --> closed

Group: Future --> v0.99992
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Khaled Hosny - 2014-07-25

Thanks, I applied the patch and will update the unicode-letters.tex file soon.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

unicode-char-prep.pl wrongly parse UnicodeData.txt for hangul scripts.

Group

Searches

Help

#84 unicode-char-prep.pl wrongly parse UnicodeData.txt for hangul scripts.

Discussion