再议zip文档乱码问题及解决方案

系统字体配置、中文显示和输入法问题
回复
zrqlx126
帖子: 374
注册时间: 2007-06-22 18:52

再议zip文档乱码问题及解决方案

#1

帖子 zrqlx126 » 2020-10-28 16:46

之前在论坛里发过一篇文章viewtopic.php?f=35&t=491109,但是这个方案,是在启动时为unzip添加参数,予以解决,并且对file-roller源码打补丁,删除其调用p7zip来解压zip文件,达到避免乱码的目的,由于不知道zip乱码的根本原因,总体很不完美。

本次偶然翻阅网站https://github.com/unxed/oemcp,发现这个解决方案,很不错,其对zip乱码的根本原因进行了解释,并提出了解决方法。
Windows store file names in .zip archives using so called OEM code page. That's why you sometimes see wrong characters when trying to open .zip file. This is well-known issue plaguing open source community, see this issue for example: https://github.com/mate-desktop/engrampa/issues/5
下面针对ubuntu20.10进行操作:

代码: 全选

# 更新软件源并下载源码
sudo apt update
apt source unzip p7zip
# p7zip 版本为16.02+dfsg-8
# unzip 版本为6.0-25ubuntu1
# 安装编译依赖
sudo apt build-dep p7zip unzip
# 为unzip打补丁
cd unzip-6.0
cat > debian/patches/25-unzip_oemcpauto_unix.c.patch << 'EOF'
Index: unzip-6.0/unix/unix.c
===================================================================
--- a/unix/unix.c	2020-10-28 15:38:39.000000000 +0800
+++ b/unix/unix.c	2020-10-28 15:48:44.382126431 +0800
@@ -1879,13 +1879,16 @@
 #endif /* QLZIP */
 
 
+/*
 typedef struct {
     char *local_charset;
     char *archive_charset;
 } CHARSET_MAP;
+*/
 
 /* A mapping of local <-> archive charsets used by default to convert filenames
  * of DOS/Windows Zip archives. Currently very basic. */
+/*
 static CHARSET_MAP dos_charset_map[] = {
     { "ANSI_X3.4-1968", "CP850" },
     { "ISO-8859-1", "CP850" },
@@ -1895,6 +1898,57 @@
     { "KOI8-U", "CP866" },
     { "ISO-8859-5", "CP866" }
 };
+*/
+
+char *lc_to_oem_cp(char *lc) {
+    static char *lc_to_cp_table[] = {
+    "af_ZA", "CP850", "ar_SA", "CP720", "ar_LB", "CP720", "ar_EG", "CP720",
+    "ar_DZ", "CP720", "ar_BH", "CP720", "ar_IQ", "CP720", "ar_JO", "CP720",
+    "ar_KW", "CP720", "ar_LY", "CP720", "ar_MA", "CP720", "ar_OM", "CP720",
+    "ar_QA", "CP720", "ar_SY", "CP720", "ar_TN", "CP720", "ar_AE", "CP720",
+    "ar_YE", "CP720","ast_ES", "CP850", "az_AZ", "CP866", "az_AZ", "CP857",
+    "be_BY", "CP866", "bg_BG", "CP866", "br_FR", "CP850", "ca_ES", "CP850",
+    "zh_CN", "CP936", "zh_TW", "CP950", "kw_GB", "CP850", "cs_CZ", "CP852",
+    "cy_GB", "CP850", "da_DK", "CP850", "de_AT", "CP850", "de_LI", "CP850",
+    "de_LU", "CP850", "de_CH", "CP850", "de_DE", "CP850", "el_GR", "CP737",
+    "en_AU", "CP850", "en_CA", "CP850", "en_GB", "CP850", "en_IE", "CP850",
+    "en_JM", "CP850", "en_BZ", "CP850", "en_PH", "CP437", "en_ZA", "CP437",
+    "en_TT", "CP850", "en_US", "CP437", "en_ZW", "CP437", "en_NZ", "CP850",
+    "es_PA", "CP850", "es_BO", "CP850", "es_CR", "CP850", "es_DO", "CP850",
+    "es_SV", "CP850", "es_EC", "CP850", "es_GT", "CP850", "es_HN", "CP850",
+    "es_NI", "CP850", "es_CL", "CP850", "es_MX", "CP850", "es_ES", "CP850",
+    "es_CO", "CP850", "es_ES", "CP850", "es_PE", "CP850", "es_AR", "CP850",
+    "es_PR", "CP850", "es_VE", "CP850", "es_UY", "CP850", "es_PY", "CP850",
+    "et_EE", "CP775", "eu_ES", "CP850", "fa_IR", "CP720", "fi_FI", "CP850",
+    "fo_FO", "CP850", "fr_FR", "CP850", "fr_BE", "CP850", "fr_CA", "CP850",
+    "fr_LU", "CP850", "fr_MC", "CP850", "fr_CH", "CP850", "ga_IE", "CP437",
+    "gd_GB", "CP850", "gv_IM", "CP850", "gl_ES", "CP850", "he_IL", "CP862",
+    "hr_HR", "CP852", "hu_HU", "CP852", "id_ID", "CP850", "is_IS", "CP850",
+    "it_IT", "CP850", "it_CH", "CP850", "iv_IV", "CP437", "ja_JP", "CP932",
+    "kk_KZ", "CP866", "ko_KR", "CP949", "ky_KG", "CP866", "lt_LT", "CP775",
+    "lv_LV", "CP775", "mk_MK", "CP866", "mn_MN", "CP866", "ms_BN", "CP850",
+    "ms_MY", "CP850", "nl_BE", "CP850", "nl_NL", "CP850", "nl_SR", "CP850",
+    "nn_NO", "CP850", "nb_NO", "CP850", "pl_PL", "CP852", "pt_BR", "CP850",
+    "pt_PT", "CP850", "rm_CH", "CP850", "ro_RO", "CP852", "ru_RU", "CP866",
+    "sk_SK", "CP852", "sl_SI", "CP852", "sq_AL", "CP852", "sr_RS", "CP855",
+    "sr_RS", "CP852", "sv_SE", "CP850", "sv_FI", "CP850", "sw_KE", "CP437",
+    "th_TH", "CP874", "tr_TR", "CP857", "tt_RU", "CP866", "uk_UA", "CP866",
+    "ur_PK", "CP720", "uz_UZ", "CP866", "uz_UZ", "CP857", "vi_VN", "CP1258",
+    "wa_BE", "CP850", "zh_HK", "CP950", "zh_SG", "CP936"};
+    int table_len = sizeof(lc_to_cp_table) / sizeof(char *);
+    int lc_len, i;
+
+    if (lc && lc[0]) {
+        // Compare up to the dot, if it exists, e.g. en_US.UTF-8
+        for (lc_len = 0; lc[lc_len] != '.' && lc[lc_len] != '\0'; ++lc_len)
+            ;
+        for (i = 0; i < table_len; i += 2)
+            if (strncmp(lc, lc_to_cp_table[i], lc_len) == 0)
+                return lc_to_cp_table[i + 1];
+    }
+
+    return "CP437";
+}
 
 char OEM_CP[MAX_CP_NAME] = "";
 char ISO_CP[MAX_CP_NAME] = "";
@@ -1903,10 +1957,20 @@
  * ISO_CP is left alone for now. */
 void init_conversion_charsets()
 {
+    char *oemcp;
+    oemcp = getenv("OEMCP");
+    if (!oemcp) {
+        oemcp = lc_to_oem_cp(setlocale(LC_CTYPE, ""));
+    }
+    strncpy(OEM_CP, oemcp, strlen(oemcp));
+
+    /*
     const char *local_charset;
     int i;
+    */
 
     /* Make a guess only if OEM_CP not already set. */ 
+    /*
     if(*OEM_CP == '\0') {
     	local_charset = nl_langinfo(CODESET);
     	for(i = 0; i < sizeof(dos_charset_map)/sizeof(CHARSET_MAP); i++)
@@ -1916,6 +1980,7 @@
     			break;
     		}
     }
+    */
 }
 
 /* Convert a string from one encoding to the current locale using iconv().
EOF
echo "25-unzip_oemcpauto_unix.c.patch" >> debian/patches/series
tar Jcvf ../unzip_*.debian.tar.xz debian/
# 开始编译unzip
dpkg-buildpackage
# 为p7zip打补丁
cd ../p7zip-16.02+dfsg
cat > debian/patches/16-oemcp_ZipItem.cpp.patch << 'EOF'
Index: p7zip-16.02+dfsg/CPP/7zip/Archive/Zip/ZipItem.cpp
===================================================================
--- a/CPP/7zip/Archive/Zip/ZipItem.cpp	2020-10-28 15:38:39.000000000 +0800
+++ b/CPP/7zip/Archive/Zip/ZipItem.cpp	2020-10-28 15:48:44.382126431 +0800
@@ -1,5 +1,10 @@
 // Archive/ZipItem.cpp
 
+#ifndef _WIN32
+#include <iconv.h>
+#include <locale.h>
+#endif
+
 #include "StdAfx.h"
 
 #include "../../../../C/CpuArch.h"
@@ -244,6 +249,86 @@
     #endif
   }
   
+  #ifndef _WIN32
+  // Convert OEM char set to UTF-8 if needed
+  // Use system locale to select code page
+
+  Byte hostOS = GetHostOS();
+  if (!isUtf8 && ((hostOS == NFileHeader::NHostOS::kFAT) || (hostOS == NFileHeader::NHostOS::kNTFS))) {
+
+    const char *oemcp;
+    oemcp = getenv("OEMCP");
+    if (!oemcp) {
+      oemcp = "CP437\0"; // CP name is 6 chars max
+
+      const char *lc_to_cp_table[] = {
+      "af_ZA", "CP850", "ar_SA", "CP720", "ar_LB", "CP720", "ar_EG", "CP720",
+      "ar_DZ", "CP720", "ar_BH", "CP720", "ar_IQ", "CP720", "ar_JO", "CP720",
+      "ar_KW", "CP720", "ar_LY", "CP720", "ar_MA", "CP720", "ar_OM", "CP720",
+      "ar_QA", "CP720", "ar_SY", "CP720", "ar_TN", "CP720", "ar_AE", "CP720",
+      "ar_YE", "CP720","ast_ES", "CP850", "az_AZ", "CP866", "az_AZ", "CP857",
+      "be_BY", "CP866", "bg_BG", "CP866", "br_FR", "CP850", "ca_ES", "CP850",
+      "zh_CN", "CP936", "zh_TW", "CP950", "kw_GB", "CP850", "cs_CZ", "CP852",
+      "cy_GB", "CP850", "da_DK", "CP850", "de_AT", "CP850", "de_LI", "CP850",
+      "de_LU", "CP850", "de_CH", "CP850", "de_DE", "CP850", "el_GR", "CP737",
+      "en_AU", "CP850", "en_CA", "CP850", "en_GB", "CP850", "en_IE", "CP850",
+      "en_JM", "CP850", "en_BZ", "CP850", "en_PH", "CP437", "en_ZA", "CP437",
+      "en_TT", "CP850", "en_US", "CP437", "en_ZW", "CP437", "en_NZ", "CP850",
+      "es_PA", "CP850", "es_BO", "CP850", "es_CR", "CP850", "es_DO", "CP850",
+      "es_SV", "CP850", "es_EC", "CP850", "es_GT", "CP850", "es_HN", "CP850",
+      "es_NI", "CP850", "es_CL", "CP850", "es_MX", "CP850", "es_ES", "CP850",
+      "es_CO", "CP850", "es_ES", "CP850", "es_PE", "CP850", "es_AR", "CP850",
+      "es_PR", "CP850", "es_VE", "CP850", "es_UY", "CP850", "es_PY", "CP850",
+      "et_EE", "CP775", "eu_ES", "CP850", "fa_IR", "CP720", "fi_FI", "CP850",
+      "fo_FO", "CP850", "fr_FR", "CP850", "fr_BE", "CP850", "fr_CA", "CP850",
+      "fr_LU", "CP850", "fr_MC", "CP850", "fr_CH", "CP850", "ga_IE", "CP437",
+      "gd_GB", "CP850", "gv_IM", "CP850", "gl_ES", "CP850", "he_IL", "CP862",
+      "hr_HR", "CP852", "hu_HU", "CP852", "id_ID", "CP850", "is_IS", "CP850",
+      "it_IT", "CP850", "it_CH", "CP850", "iv_IV", "CP437", "ja_JP", "CP932",
+      "kk_KZ", "CP866", "ko_KR", "CP949", "ky_KG", "CP866", "lt_LT", "CP775",
+      "lv_LV", "CP775", "mk_MK", "CP866", "mn_MN", "CP866", "ms_BN", "CP850",
+      "ms_MY", "CP850", "nl_BE", "CP850", "nl_NL", "CP850", "nl_SR", "CP850",
+      "nn_NO", "CP850", "nb_NO", "CP850", "pl_PL", "CP852", "pt_BR", "CP850",
+      "pt_PT", "CP850", "rm_CH", "CP850", "ro_RO", "CP852", "ru_RU", "CP866",
+      "sk_SK", "CP852", "sl_SI", "CP852", "sq_AL", "CP852", "sr_RS", "CP855",
+      "sr_RS", "CP852", "sv_SE", "CP850", "sv_FI", "CP850", "sw_KE", "CP437",
+      "th_TH", "CP874", "tr_TR", "CP857", "tt_RU", "CP866", "uk_UA", "CP866",
+      "ur_PK", "CP720", "uz_UZ", "CP866", "uz_UZ", "CP857", "vi_VN", "CP1258",
+      "wa_BE", "CP850", "zh_HK", "CP950", "zh_SG", "CP936"};
+      int table_len = sizeof(lc_to_cp_table) / sizeof(char *);
+      int lc_len, i;
+
+      char *lc = setlocale(LC_CTYPE, "");
+
+      if (lc && lc[0]) {
+          // Compare up to the dot, if it exists, e.g. en_US.UTF-8
+          for (lc_len = 0; lc[lc_len] != '.' && lc[lc_len] != '\0'; ++lc_len)
+              ;
+          for (i = 0; i < table_len; i += 2)
+              if (strncmp(lc, lc_to_cp_table[i], lc_len) == 0)
+                  oemcp = lc_to_cp_table[i + 1];
+      }
+    }
+
+    iconv_t cd;
+    if ((cd = iconv_open("UTF-8", oemcp)) != (iconv_t)-1) {
+
+      AString s_utf8;
+      const char* src = s.Ptr();
+      size_t slen = s.Len();
+      size_t dlen = slen * 4;
+      const char* dest = s_utf8.GetBuf_SetEnd(dlen + 1); // (source length * 4) + null termination
+
+      size_t done = iconv(cd, (char**)&src, &slen, (char**)&dest, &dlen);
+      bzero((size_t*)dest + done, 1);
+
+      iconv_close(cd);
+
+      if (ConvertUTF8ToUnicode(s_utf8, res) || ignore_Utf8_Errors)
+        return;
+    }    
+  }
+  #endif
   
   if (isUtf8)
     if (ConvertUTF8ToUnicode(s, res) || ignore_Utf8_Errors)
EOF
echo "16-oemcp_ZipItem.cpp.patch" >> debian/patches/series
tar Jcvf ../p7zip_*.debian.tar.xz debian/
# 开始编译p7zip
dpkg-buildpackage
# 安装
sudo dpkg -i ../unzip_6.0-25ubuntu1_amd64.deb
sudo dpkg -i ../p7zip_16.02+dfsg-8_amd64.deb
sudo dpkg -i ../p7zip-full_16.02+dfsg-8_amd64.deb
至此本方案完成,再行测试已经不再出现乱码的问题了。
下面是已经制作好的补丁文件(补丁文件下载后取消.txt后缀),仅建议为p7zip打补丁。
16-oemcp_ZipItem.cpp.patch.txt
(4.52 KiB) 已下载 90 次
25-unzip_oemcpauto_unix.c.patch.txt
(4.39 KiB) 已下载 92 次
上次由 zrqlx126 在 2020-10-30 9:24,总共编辑 3 次。
头像
zzugyl
帖子: 356
注册时间: 2011-03-07 17:26
系统: Ubuntu 20.04.3 LTS

Re: 再议zip文档乱码问题及解决方案[更改版本号防止更新]

#2

帖子 zzugyl » 2020-10-29 9:43

windows上使用7z压缩的文档,在Ubuntu下乱码。但是使用bandzip压缩的,不会乱码。
你的补丁包在我Ubuntu20.04上测试通过。
常在深闺人未识 一朝成名天下知
头像
百草谷居士
帖子: 3937
注册时间: 2006-02-10 16:36
系统: debian12/xubuntu2404

Re: 再议zip文档乱码问题及解决方案[更改版本号防止更新]

#3

帖子 百草谷居士 » 2020-10-29 10:20

看不懂,好高级。

这么好的解决方案能不能提交到源?ubuntu?debian?linux/gnu?
debian 12.5 / xubuntu 24.04

为何热衷于搞发行版的多,搞应用程序开发的少?Linux最多余的就是各种发行版,最缺的就是应用程序,特别是行业应用程序。
头像
astolia
论坛版主
帖子: 6544
注册时间: 2008-09-18 13:11

Re: 再议zip文档乱码问题及解决方案[更改版本号防止更新]

#4

帖子 astolia » 2020-10-29 18:58

zrqlx126 写了: 2020-10-28 16:46 其对zip乱码的根本原因进行了解释,并提出了解决方法。
Windows store file names in .zip archives using so called OEM code page. That's why you sometimes see wrong characters when trying to open .zip file. This is well-known issue plaguing open source community, see this issue for example: https://github.com/mate-desktop/engrampa/issues/5
这个并不算是完整的根本原因,解决方案也不够完美。

ZIP压缩格式的第一个规范是在1989年提出的,其中没有对文件名的编码做任何要求。所以各个软件都可以按自己的喜好对文件名进行编码,而都不违反ZIP格式规范。大多数软件的做法都是使用操作系统默认采用的编码,windows就是它提到的oem code page。

一直到了2007年,ZIP格式规范才明确了文件名编码的规定,说如果在文件中增加一标志位,如果设置了就表示文件名是以UTF-8编码的。但并没有说明在没设置的情况下要怎么处理文件名编码。所以各个软件即使按照最新标准进行了更新,对这种没设置标志位的情况,做法还是照旧。

而这个补丁干的事,就是根据用户当前的LC_CTYPE环境变量,来猜测文件名可能使用的编码。如果当前是简体中文环境zh_CN.*,那么就尝试把文件名从windows在简体中文版本中使用的CP936编码转换到UTF-8。

这样做存在的问题也很明显,首先是用户当前的语言环境并不一定是UTF-8。虽然现在这种情况应该很少了,但在十多年前,有不少教程上都是教人把语言环境设成GB2312/GBK/GB18030来解决打开windows文件乱码的问题。如果现在还有人这么做,那么这个补丁一律转成UTF-8的做法,在这些人的系统上,反而把本来能正确显示的文件名弄成乱码了。其次,用户系统的语言设定不一定和ZIP文件在压缩时的系统语言区域一致。简体中文用户也不是没有需要打开其他语言用户制作的zip文件的可能性。

这个补丁允许通过OEM_CP环境变量手动指定编码,也算是对上面问题的一种解决方案。但针对上面第二点问题,可能需要反复设置OEM_CP来试错,对于在图形界面下使用fileroller、engrampa这种图形前端就显得非常麻烦,如果能在这种自动判断的基础上,给软件界面增加一个手工选择编码的方式就好了。

当然,如果各种压缩软件都能与时俱进,在压缩时设置好标志位,用户也都用最新的软件,也就没这么多破事了。
头像
百草谷居士
帖子: 3937
注册时间: 2006-02-10 16:36
系统: debian12/xubuntu2404

Re: 再议zip文档乱码问题及解决方案[更改版本号防止更新]

#5

帖子 百草谷居士 » 2020-10-29 21:39

我记得以前提过,使用peazip,几乎没有遇到过乱码。
没有进行过专门测试,感兴趣的可以进行测试
debian 12.5 / xubuntu 24.04

为何热衷于搞发行版的多,搞应用程序开发的少?Linux最多余的就是各种发行版,最缺的就是应用程序,特别是行业应用程序。
zrqlx126
帖子: 374
注册时间: 2007-06-22 18:52

Re: 再议zip文档乱码问题及解决方案[更改版本号防止更新]

#6

帖子 zrqlx126 » 2020-10-30 7:51

astolia 写了: 2020-10-29 18:58
zrqlx126 写了: 2020-10-28 16:46 其对zip乱码的根本原因进行了解释,并提出了解决方法。
Windows store file names in .zip archives using so called OEM code page. That's why you sometimes see wrong characters when trying to open .zip file. This is well-known issue plaguing open source community, see this issue for example: https://github.com/mate-desktop/engrampa/issues/5
这个并不算是完整的根本原因,解决方案也不够完美。

ZIP压缩格式的第一个规范是在1989年提出的,其中没有对文件名的编码做任何要求。所以各个软件都可以按自己的喜好对文件名进行编码,而都不违反ZIP格式规范。大多数软件的做法都是使用操作系统默认采用的编码,windows就是它提到的oem code page。

一直到了2007年,ZIP格式规范才明确了文件名编码的规定,说如果在文件中增加一标志位,如果设置了就表示文件名是以UTF-8编码的。但并没有说明在没设置的情况下要怎么处理文件名编码。所以各个软件即使按照最新标准进行了更新,对这种没设置标志位的情况,做法还是照旧。

而这个补丁干的事,就是根据用户当前的LC_CTYPE环境变量,来猜测文件名可能使用的编码。如果当前是简体中文环境zh_CN.*,那么就尝试把文件名从windows在简体中文版本中使用的CP936编码转换到UTF-8。

这样做存在的问题也很明显,首先是用户当前的语言环境并不一定是UTF-8。虽然现在这种情况应该很少了,但在十多年前,有不少教程上都是教人把语言环境设成GB2312/GBK/GB18030来解决打开windows文件乱码的问题。如果现在还有人这么做,那么这个补丁一律转成UTF-8的做法,在这些人的系统上,反而把本来能正确显示的文件名弄成乱码了。其次,用户系统的语言设定不一定和ZIP文件在压缩时的系统语言区域一致。简体中文用户也不是没有需要打开其他语言用户制作的zip文件的可能性。

这个补丁允许通过OEM_CP环境变量手动指定编码,也算是对上面问题的一种解决方案。但针对上面第二点问题,可能需要反复设置OEM_CP来试错,对于在图形界面下使用fileroller、engrampa这种图形前端就显得非常麻烦,如果能在这种自动判断的基础上,给软件界面增加一个手工选择编码的方式就好了。

当然,如果各种压缩软件都能与时俱进,在压缩时设置好标志位,用户也都用最新的软件,也就没这么多破事了。
确实如此,这个方案依然是根据系统环境变量选择相应的字符编码,来解压zip文档,跟原来发的帖子上的方法没有太大本质区别。同样的的一份windows下压缩的中文zip文档,即便打了上述补丁,在英文环境下,依然会乱码。针对unzip尚有选择字符编码的选项,而p7zip根本就没有。所以稳妥期间,仅建议为p7zip打补丁。
回复