程序笔记   发布时间:2022-05-30  发布网站:大佬教程  code.js-code.com
大佬教程收集整理的这篇文章主要介绍了检测字节流是否是UTF8编码大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参

先聊聊原理:

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂,总结起来如下:

ASCII码(U+0000 - U+007F),不编码

其余编码规则为

根据这个规则,我给出的C#代码如下:

        /// <sumMary>
        ///   Determines whether the given <paramref name="inputStream"/>is UTF8 enCoding bytes.
        /// </sumMary>
        /// <param name="inputStream">
        ///    The input stream.
        ///  </param>
        /// <returns>
        ///   <see langword="true"/> if given bystes stream is in UTF8 enCoding; otherwise,<see langword="false"/>.
        /// </returns>
        /// <REMARKs>
        ///   All ASCII chars will regards not UTF8 enCoding.
        /// </REMARKs>
        public static bool IsTextUTF8(ref byte[] inputStream)
        {
            int enCodingBytesCount = 0;
            bool allTextsAreASCIIChars = true;

            for (int i = 0; i < inputStream.Length; i++)
            {
                byte current = inputStream[i];

                if ((current & 0x80) == 0x80)
                {                    
                    allTextsAreASCIIChars = false;
                }
                // First byte
                if (enCodingBytesCount == 0)
                {
                    if ((current & 0x80) == 0)
                    {
                        // ASCII chars,from 0x00-0x7F
                        conTinue;
                    }

                    if ((current & 0xC0) == 0xC0)
                    {
                        enCodingBytesCount = 1;
                        current <<= 2;

                        // More than two bytes used to enCoding a unicode char.
                        // Calculate the real length.
                        while ((current & 0x80) == 0x80)
                        {
                            current <<= 1;
                            enCodingBytesCount++;
                        }
                    }                    
                    else
                    {
                        // InvalID bits structure for UTF8 enCoding rule.
                        return false;
                    }
                }                
                else
                {
                    // Following bytes,must start with 10.
                    if ((current & 0xC0) == 0x80)
                    {                        
                        enCodingBytesCount--;
                    }
                    else
                    {
                        // InvalID bits structure for UTF8 enCoding rule.
                        return false;
                    }
                }
            }

            if (enCodingBytesCount != 0)
            {
                // InvalID bits structure for UTF8 enCoding rule.
                // Wrong following bytes count.
                return false;
            }

            // Although UTF8 supports enCoding for ASCII chars,we regard as a input stream,whose contents are all ASCII as default enCoding.
            return !allTextsAreASCIIChars;
        }

 

 

再附上单元测试代码:

 

    /// <sumMary>
    ///This is a test class for EnCodingHelperTest and is intended
    ///to contain all EnCodingHelperTest Unit Tests
    ///</sumMary>
    [TESTClass()]
    public class EnCodingHelperTest
    {
        /// <sumMary>
        ///  normal test for this method.
        ///</sumMary>
        [TestMethod()]
        public voID IsTextUTF8test()
        {
            for (int i = 0; i < 1000; i++)
            {
                List<Char> chars = new List<char>();
                chars.Add('中');

                List<Unicodecategory> temp = new List<Unicodecategory>();
                Random rd = new Random((int)(datetiR_291_11845@e.Now.Ticks & 0x7FFFFFFF));

                for (int j = 0; j < 255; j++)
                {
                    char ch = (char)rd.Next(0xFFFF);
                    Unicodecategory uc = System.Globalization.CharUnicodeInfo.GetUnicodecategory(ch);
                    if (uc == Unicodecategory.Surrogate || // Single surrogate Could not be enCoding correctly.
                        uc == Unicodecategory.PrivateUse || // Private use blocks should be excluded.
                        uc == Unicodecategory.otherNotAssigned
                        )
                    {
                        j--;
                    }
                    else
                    {
                        chars.Add(ch);
                        temp.Add(uc);
                    }
                }

                String str = new String(chars.ToArray());

                byte[] inputStream = EnCoding.UTF8.GetBytes(str);
                bool expected = true; 
                bool actual;
                actual = EnCodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected,actual,String.Format("UTF8_Assert Fails at:{0}",str));

                inputStream = EnCoding.GetEnCoding(932).GetBytes(str);
                expected = false;

                actual = EnCodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected,String.Format("ShiftJIS_Assert Fails at:{0}",str));
            }
        }

        /// <sumMary>
        ///   check with All ASCII chars
        /// </sumMary>
        [TestMethod]
        public voID IsTextUTF8test_AllASCII()
        {
            String str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";

            byte[] inputStream = EnCoding.UTF8.GetBytes(str);
            bool expected = false;
            bool actual;
            actual = EnCodingHelper.IsTextUTF8(ref inputStream);
            Assert.AreEqual(expected,str));


        }
    }

 

另:

如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

维基百科:http://en.wikipedia.org/wiki/UTF-8

大佬总结

以上是大佬教程为你收集整理的检测字节流是否是UTF8编码全部内容,希望文章能够帮你解决检测字节流是否是UTF8编码所遇到的程序开发问题。

如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。
标签: