大佬教程收集整理的这篇文章主要介绍了检测字节流是否是UTF8编码,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”
也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。
使用无数或条件的正则表达式用起来却是性能不高。
刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考
先聊聊原理:
UTF8的编码规则如下表
看起来很复杂,总结起来如下:
ASCII码(U+0000 - U+007F),不编码
其余编码规则为
根据这个规则,我给出的C#代码如下:
/// <sumMary> /// Determines whether the given <paramref name="inputStream"/>is UTF8 enCoding bytes. /// </sumMary> /// <param name="inputStream"> /// The input stream. /// </param> /// <returns> /// <see langword="true"/> if given bystes stream is in UTF8 enCoding; otherwise,<see langword="false"/>. /// </returns> /// <REMARKs> /// All ASCII chars will regards not UTF8 enCoding. /// </REMARKs> public static bool IsTextUTF8(ref byte[] inputStream) { int enCodingBytesCount = 0; bool allTextsAreASCIIChars = true; for (int i = 0; i < inputStream.Length; i++) { byte current = inputStream[i]; if ((current & 0x80) == 0x80) { allTextsAreASCIIChars = false; } // First byte if (enCodingBytesCount == 0) { if ((current & 0x80) == 0) { // ASCII chars,from 0x00-0x7F conTinue; } if ((current & 0xC0) == 0xC0) { enCodingBytesCount = 1; current <<= 2; // More than two bytes used to enCoding a unicode char. // Calculate the real length. while ((current & 0x80) == 0x80) { current <<= 1; enCodingBytesCount++; } } else { // InvalID bits structure for UTF8 enCoding rule. return false; } } else { // Following bytes,must start with 10. if ((current & 0xC0) == 0x80) { enCodingBytesCount--; } else { // InvalID bits structure for UTF8 enCoding rule. return false; } } } if (enCodingBytesCount != 0) { // InvalID bits structure for UTF8 enCoding rule. // Wrong following bytes count. return false; } // Although UTF8 supports enCoding for ASCII chars,we regard as a input stream,whose contents are all ASCII as default enCoding. return !allTextsAreASCIIChars; }
再附上单元测试代码:
/// <sumMary> ///This is a test class for EnCodingHelperTest and is intended ///to contain all EnCodingHelperTest Unit Tests ///</sumMary> [TESTClass()] public class EnCodingHelperTest { /// <sumMary> /// normal test for this method. ///</sumMary> [TestMethod()] public voID IsTextUTF8test() { for (int i = 0; i < 1000; i++) { List<Char> chars = new List<char>(); chars.Add('中'); List<Unicodecategory> temp = new List<Unicodecategory>(); Random rd = new Random((int)(datetiR_291_11845@e.Now.Ticks & 0x7FFFFFFF)); for (int j = 0; j < 255; j++) { char ch = (char)rd.Next(0xFFFF); Unicodecategory uc = System.Globalization.CharUnicodeInfo.GetUnicodecategory(ch); if (uc == Unicodecategory.Surrogate || // Single surrogate Could not be enCoding correctly. uc == Unicodecategory.PrivateUse || // Private use blocks should be excluded. uc == Unicodecategory.otherNotAssigned ) { j--; } else { chars.Add(ch); temp.Add(uc); } } String str = new String(chars.ToArray()); byte[] inputStream = EnCoding.UTF8.GetBytes(str); bool expected = true; bool actual; actual = EnCodingHelper.IsTextUTF8(ref inputStream); Assert.AreEqual(expected,actual,String.Format("UTF8_Assert Fails at:{0}",str)); inputStream = EnCoding.GetEnCoding(932).GetBytes(str); expected = false; actual = EnCodingHelper.IsTextUTF8(ref inputStream); Assert.AreEqual(expected,String.Format("ShiftJIS_Assert Fails at:{0}",str)); } } /// <sumMary> /// check with All ASCII chars /// </sumMary> [TestMethod] public voID IsTextUTF8test_AllASCII() { String str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD"; byte[] inputStream = EnCoding.UTF8.GetBytes(str); bool expected = false; bool actual; actual = EnCodingHelper.IsTextUTF8(ref inputStream); Assert.AreEqual(expected,str)); } }
另:
如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。
参考:
维基百科:http://en.wikipedia.org/wiki/UTF-8
以上是大佬教程为你收集整理的检测字节流是否是UTF8编码全部内容,希望文章能够帮你解决检测字节流是否是UTF8编码所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。