分类导航

程序笔记发布时间：2022-05-30 发布网站：大佬教程 code.js-code.com

大佬教程收集整理的这篇文章主要介绍了检测字节流是否是UTF8编码，大佬教程大佬觉得挺不错的，现在分享给大家，也给大家做个参考。

几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

使用无数或条件的正则表达式用起来却是性能不高。

刚好曾经在项目中有类似的需求，这里把处理思路和整理后的源代码贴出来供大家参考

先聊聊原理：

UTF8的编码规则如下表

UTF8 Encoding Rule

看起来很复杂，总结起来如下：

ASCII码（U+0000 - U+007F），不编码

其余编码规则为

根据这个规则，我给出的C#代码如下：

        /// <sumMary>
        ///   Determines whether the given <paramref name="inputStream"/>is UTF8 enCoding bytes.
        /// </sumMary>
        /// <param name="inputStream">
        ///    The input stream.
        ///  </param>
        /// <returns>
        ///   <see langword="true"/> if given bystes stream is in UTF8 enCoding; otherwise,<see langword="false"/>.
        /// </returns>
        /// <REMARKs>
        ///   All ASCII chars will regards not UTF8 enCoding.
        /// </REMARKs>
        public static bool IsTextUTF8(ref byte[] inputStream)
        {
            int enCodingBytesCount = 0;
            bool allTextsAreASCIIChars = true;

            for (int i = 0; i < inputStream.Length; i++)
            {
                byte current = inputStream[i];

                if ((current & 0x80) == 0x80)
                {                    
                    allTextsAreASCIIChars = false;
                }
                // First byte
                if (enCodingBytesCount == 0)
                {
                    if ((current & 0x80) == 0)
                    {
                        // ASCII chars,from 0x00-0x7F
                        conTinue;
                    }

                    if ((current & 0xC0) == 0xC0)
                    {
                        enCodingBytesCount = 1;
                        current <<= 2;

                        // More than two bytes used to enCoding a unicode char.
                        // Calculate the real length.
                        while ((current & 0x80) == 0x80)
                        {
                            current <<= 1;
                            enCodingBytesCount++;
                        }
                    }                    
                    else
                    {
                        // InvalID bits structure for UTF8 enCoding rule.
                        return false;
                    }
                }                
                else
                {
                    // Following bytes,must start with 10.
                    if ((current & 0xC0) == 0x80)
                    {                        
                        enCodingBytesCount--;
                    }
                    else
                    {
                        // InvalID bits structure for UTF8 enCoding rule.
                        return false;
                    }
                }
            }

            if (enCodingBytesCount != 0)
            {
                // InvalID bits structure for UTF8 enCoding rule.
                // Wrong following bytes count.
                return false;
            }

            // Although UTF8 supports enCoding for ASCII chars,we regard as a input stream,whose contents are all ASCII as default enCoding.
            return !allTextsAreASCIIChars;
        }

再附上单元测试代码：

    /// <sumMary>
    ///This is a test class for EnCodingHelperTest and is intended
    ///to contain all EnCodingHelperTest Unit Tests
    ///</sumMary>
    [TESTClass()]
    public class EnCodingHelperTest
    {
        /// <sumMary>
        ///  normal test for this method.
        ///</sumMary>
        [TestMethod()]
        public voID IsTextUTF8test()
        {
            for (int i = 0; i < 1000; i++)
            {
                List<Char> chars = new List<char>();
                chars.Add('中');

                List<Unicodecategory> temp = new List<Unicodecategory>();
                Random rd = new Random((int)(datetiR_291_11845@e.Now.Ticks & 0x7FFFFFFF));

                for (int j = 0; j < 255; j++)
                {
                    char ch = (char)rd.Next(0xFFFF);
                    Unicodecategory uc = System.Globalization.CharUnicodeInfo.GetUnicodecategory(ch);
                    if (uc == Unicodecategory.Surrogate || // Single surrogate Could not be enCoding correctly.
                        uc == Unicodecategory.PrivateUse || // Private use blocks should be excluded.
                        uc == Unicodecategory.otherNotAssigned
                        )
                    {
                        j--;
                    }
                    else
                    {
                        chars.Add(ch);
                        temp.Add(uc);
                    }
                }

                String str = new String(chars.ToArray());

                byte[] inputStream = EnCoding.UTF8.GetBytes(str);
                bool expected = true; 
                bool actual;
                actual = EnCodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected,actual,String.Format("UTF8_Assert Fails at:{0}",str));

                inputStream = EnCoding.GetEnCoding(932).GetBytes(str);
                expected = false;

                actual = EnCodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected,String.Format("ShiftJIS_Assert Fails at:{0}",str));
            }
        }

        /// <sumMary>
        ///   check with All ASCII chars
        /// </sumMary>
        [TestMethod]
        public voID IsTextUTF8test_AllASCII()
        {
            String str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";

            byte[] inputStream = EnCoding.UTF8.GetBytes(str);
            bool expected = false;
            bool actual;
            actual = EnCodingHelper.IsTextUTF8(ref inputStream);
            Assert.AreEqual(expected,str));


        }
    }

另：

如果是判断一个文件是否使用了UTF8编码，不一定非用这种方法，因为通常以UTF8格式保存的文件最初两个字符是BOM头，标示该文件使用了UTF8编码。

参考：

维基百科：http://en.wikipedia.org/wiki/UTF-8

大佬总结

以上是大佬教程为你收集整理的检测字节流是否是UTF8编码全部内容，希望文章能够帮你解决检测字节流是否是UTF8编码所遇到的程序开发问题。

如果觉得大佬教程网站内容还不错，欢迎将大佬教程推荐给程序员好友。

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ：384754419，请注明来意。

标签：

上一篇: Url Rewrite 再说Url 重写下一篇:性能测试之稳定性测试（可靠性测...

猜你在找的程序笔记相关文章

You can't specify target table 'xxx' for update in FROM clause的解决 2022-07-21
【UNIAPP】上传视频，进度条的前台与后端 2022-07-21
十款代码表白特效，一个比一个浪漫！ 2022-07-04
作业3 2022-07-06
linux系统下部署项目访问报404错误的解决方法 2022-05-30
C++带有指针成员的类处理方式详解 2022-06-07
Linux——（1）基本命令 2022-07-21
JavaScript之正则表达式学习笔记 2019-11-07
Redux源码学习笔记 2019-11-07
Webpack学习笔记 2019-11-07