Hi,
We just noticed that when exporting to a UTF-8 text file, Fine Reader Engine adds a BOM (Byte Order Mark) character at the beginning of the file.
page.Export(tempTxtFile.getAbsolutePath(), FileExportFormatEnum.FEF_TextUnicodeDefaults, exportParams);
This BOM character (EF BB BF) indicates the Unicode representation of the text.
But when using UTF-8 it is optionnal and not recommended (ref. Unicode Standard 5.0) . Especially for Java which assumes that UTF8 files don't have a BOM. When reading the file, BOM character will be interpreted as ? in Java which is really annoying.
More infos here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
Currently we have a workaround ( http://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker) but it would be nice to condiser removing it in the future or make it optional ;)
Comments
1 comment
Sorry for the delay with response.
We have passed your suggestion to our analysts and created reclamation to make BOM character optional. Unfortunately, so far we do nоt have information when this feature will be available and we hope that will be implementing in the future versions.
Please sign in to leave a comment.