Вопрос: Как получить согласованное байтовое представление строк в C # без ручного указания кодировки?


Как конвертировать stringк byte[]в .NET (C #) без указания конкретной кодировки вручную?

Я собираюсь зашифровать строку. Я могу зашифровать его без преобразования, но мне все равно хотелось бы знать, почему здесь начинается кодирование.

Кроме того, почему кодирование должно приниматься во внимание? Не могу ли я просто получить, в каких байтах хранится строка? Почему существует зависимость от кодировок символов?


1896


источник


Ответы:


В отличие от ответов здесь, вам НЕ нужно беспокоиться о кодировании если байты не нужно интерпретировать!

Как вы уже упоминали, ваша цель - просто «получить, какие байты хранится в строке» ,
(И, конечно, чтобы иметь возможность перестроить строку из байтов.)

Для этих целей я честно не понять, почему люди продолжают говорить вам, что вам нужны кодировки. Вы, конечно, НЕ должны беспокоиться о кодировках для этого.

Просто сделайте это вместо этого:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

Пока ваша программа (или другие программы) не пытается интерпретировать байты как-то, что вы, очевидно, не упоминали, что собираетесь намереваться, тогда есть ничего неправильно с этим подходом! Беспокойство по поводу кодировок просто делает вашу жизнь более сложной без какой-либо реальной причины.

Дополнительное преимущество такого подхода:

Не имеет значения, содержит ли строка недопустимые символы, потому что вы все равно можете получить данные и восстановить исходную строку в любом случае!

Он будет кодироваться и декодироваться одинаково, потому что вы просто глядя на байты ,

Однако, если вы использовали конкретную кодировку, это дало бы вам проблемы с кодированием / расшифровкой недопустимых символов.


1716



Это зависит от кодировки вашей строки ( ASCII , UTF-8, , ...).

Например:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

Небольшая выборка, почему кодирование имеет значение:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII просто не оборудован для обработки специальных символов.

Внутри .NET Framework использует UTF-16 для представления строк, поэтому, если вы просто хотите получить точные байты, которые использует .NET, используйте System.Text.Encoding.Unicode.GetBytes (...),

Видеть Кодировка символов в .NET Framework (MSDN) для получения дополнительной информации.


1048



Принятый ответ очень, очень сложный. Используйте включенные классы .NET для этого:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);

Не изобретайте велосипед, если вам не нужно ...


241



BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "喂 Hello 谢谢 Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

103



Вы должны учитывать кодировку, потому что 1 символ может быть представлен 1 или больше байты (до 6), а разные кодировки будут обрабатывать эти байты по-разному.

У Джоэля есть проводка по этому поводу:

Абсолютный минимум Каждый разработчик программного обеспечения Абсолютно, положительно должен знать о юникоде и наборах символов (никаких оправданий!)


78



This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the later first.

Common Need

Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution or skipping. The default policy is to substitute a '?'.

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

Clearly, conversions are not necessarily lossless!

Note: For System.String the source character set is Unicode.

The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.

That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what an encoding is.

Specific Need

Now, the question author asks, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"

He doesn't want any conversion.

From the C# spec:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:

Encoding.Unicode.GetBytes(".NET String to byte array")

But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:

".NET String to byte array".ToCharArray()

That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.

The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:

[For] an expression of type string, ... the initializer computes the address of the first character in the string.

To do so, the compiler writes code skip over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.


74



Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

Output:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd

Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.

Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter

谢谢!


34



Try this, a lot less code:

System.Text.Encoding.UTF8.GetBytes("TEST String");

34



The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encoding namespace.

I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?

The answer is in two parts.

First of all, the bytes used internally by the string class don't matter, and whenever you assume they do you're likely introducing a bug.

If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you.

On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end, even if it's the same encoding used internally by .Net.

I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. However, I put it to you it's just not important compared to making sure that your output is understood at the other end, and to guarantee that you must be explicit with your encoding. Additionally, if you really want to match your internal bytes, you can already just choose the Unicode encoding, and get that performance savings.

Which brings me to the second part... picking the Unicode encoding is telling .Net to use the underlying bytes. You do need to pick this encoding, because when some new-fangled Unicode-Plus comes out the .Net runtime needs to be free to use this newer, better encoding model without breaking your program. But, for the moment (and forseeable future), just choosing the Unicode encoding gives you what you want.

It's also important to understand your string has to be re-written to wire, and that involves at least some translation of the bit-pattern even when you use a matching encoding. The computer needs to account for things like Big vs Little Endian, network byte order, packetization, session information, etc.


34



Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.

It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).

So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:

public static byte[] StringToBytes(string str)
{
    byte[] data = new byte[str.Length * 2];
    for (int i = 0; i < str.Length; ++i)
    {
        char ch = str[i];
        data[i * 2] = (byte)(ch & 0xFF);
        data[i * 2 + 1] = (byte)((ch & 0xFF00) >> 8);
    }

    return data;
}

public static string StringFromBytes(byte[] arr)
{
    char[] ch = new char[arr.Length / 2];
    for (int i = 0; i < ch.Length; ++i)
    {
        ch[i] = (char)((int)arr[i * 2] + (((int)arr[i * 2 + 1]) << 8));
    }
    return new String(ch);
}

22



Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

Because there is no such thing as "the bytes of the string".

A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.

An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.

So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".

By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


18



C# to convert a string to a byte array:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}

18