强制字符向量编码从R中的“未知”到“UTF-8”

我在R中的字符向量编码不一致时遇到问题

我从中读取表格的文本文件是用UTF-8编码的(通过记事本)(我也尝试过没有BOM的UTF-8).

我想从这个文本文件中读取表,将其转换为data.table,设置一个键并使用二进制搜索.当我试图这样做时,出现以下情况:

Warning message:
In [.data.table(poli.dt, “żżonymi”, mult = “first”) :
A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn’t support
mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
others not. But if either latin1 or UTF-8 is used exclusively, and all
unknown encodings are ascii, then the result should be ok. In future
we will check for you and avoid this warning if everything is ok. The
tricky part is doing this without impacting performance for ascii-only
cases.

和二进制搜索不起作用.

我意识到我的data.table-key列包含:“unknown”和“UTF-8”编码类型:

> table(Encoding(poli.dt$word))
unknown   UTF-8 
2061312 2739122

我尝试使用以下方法转换此列(在创建data.table对象之前):

>编码(字)< - “UTF-8”
> word< - enc2utf8(word)
但没有效果.

我还尝试了几种不同的方法将文件读入R(设置所有有用的参数,例如encoding =“UTF-8”):

> data.table :: fread
> utils :: read.table
> base :: scan
> colbycol :: cbc.read.table

但没有效果.

==================================================

我的R.version:

> R.version
           _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          0.3                         
year           2014                        
month          03                          
day            06                          
svn rev        65126                       
language       R                           
version.string R version 3.0.3 (2014-03-06)
nickname       Warm Puppy

我的会话信息:

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3
如果字符串具有“本机编码”标记(在您的情况下为CP-1250)或者如果它是ASCII,则Encoding函数将返回unknown.
要区分这两种情况,请致电:

library(stringi)
stri_enc_mark(poli.dt$word)

要检查每个字符串是否是有效的UTF-8字节序列,请调用:

all(stri_enc_isutf8(poli.dt$word))

如果不是这样,那么你的文件肯定不是UTF-8.

我怀疑你没有在数据读取功能中强制使用UTF-8模式(尝试检查poli.dt $word的内容来验证这个语句).如果我的猜测是真的,请尝试:

read.csv2(file("filename", encoding="UTF-8"))

要么

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings

如果data.table仍然抱怨“混合”编码,您可能想要音译非ASCII字符,例如:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")
## [1] "Zazolc gesla jazn"
相关文章