Interesting. Unfortunate that the deprecation notice doesn't include much rationale. I found at least one mail thread about it[1], which seems to confirm that the main thought was that semantic information about text should be handled at a higher layer (e.g. XML). I can understand that argument for a general purpose tagging mechanism, but language and glyphs are strongly semantically linked.
(Somewhat ironically, the previous thread on that mailing list is about the struggles of case folding in a general fashion across multiple language scripts[2])
Edit: I also found [3], which offers the following:
----
- Most of the data sources used to assemble the documents on the Web will not contain these characters; producers, in the process of assembling or serializing the data, will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical agreement. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.
- Another challenge is that many applications that use these data formats have limitations on content, such as length limits or character set restrictions. Inserting additional characters into the data may violate these externally applied requirements, and interfere with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted, or lost as a result.
- Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.
- Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.
----
Other than #3 (the one about string identity), I find these wholly unpersuasive. And even #3 isn't even that great a reason considering that programmatic processors have to deal with that issue anyway due to case folding.
(Somewhat ironically, the previous thread on that mailing list is about the struggles of case folding in a general fashion across multiple language scripts[2])
Edit: I also found [3], which offers the following:
----
- Most of the data sources used to assemble the documents on the Web will not contain these characters; producers, in the process of assembling or serializing the data, will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical agreement. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.
- Another challenge is that many applications that use these data formats have limitations on content, such as length limits or character set restrictions. Inserting additional characters into the data may violate these externally applied requirements, and interfere with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted, or lost as a result.
- Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.
- Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.
----
Other than #3 (the one about string identity), I find these wholly unpersuasive. And even #3 isn't even that great a reason considering that programmatic processors have to deal with that issue anyway due to case folding.
[1] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0039....
[2] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0038....
[3] https://www.w3.org/TR/string-meta/