Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Topics - gibbon

Pages: [1]
1
Development / Broken character encoding
« on: December 16, 2011, 05:58:54 pm »
Hello,
thanks for the great app and especially it's Unicode support.
I've started to write an import script for japanese site (EUC-JP encoding, CP:20932) and I've noticed very strange thing.

Raw page saved automatically to file 'page.html' is OK. But when I process the page by the script (even when I output the HTML string right at the script start) it becomes broken at some places. For example:
Code: [Select]
<a href="/digital/videoa/-/detail/=/cid=41djk012/">猥らなほどに悩ましい 古都ひかる</a></p>becomes
Code: [Select]
<a href="/digital/videoa/-/detail/=/cid=41djk012/">猥らなほどに悩ましい 古都ひか・E/a></p>Which besides of changing the text, destroys the whole HTML structure.
Other examples out of many more:
Code: [Select]
~ -> ?
奥さん! ->  ・E気鵝・
女 2 -> ・E2

For many hours I've been trying different codepages of the script (20932, autodetect, UTF8), but garbled text or errors like this one are the only results.
Has anyone had similar experience? Can't there be a bug in the script parser? Is there any way around?

Thank you very much in advance.

Pages: [1]