Equivalence of Unicode strings is strange (2016)

Lobsters Hottest News

Summary

Unicode string equivalence is complex, especially with collations, leading to surprising results like deletion of control characters and non-deterministic grouping. The author discusses challenges in implementing proper Unicode support in database systems.

<p><a href="https://lobste.rs/s/awaoc2/equivalence_unicode_strings_is_strange">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:54 AM

# Equivalence of Unicode strings is strange Source: [http://databasearchitects.blogspot.com/2016/08/equivalence-of-unicode-strings-is.html](http://databasearchitects.blogspot.com/2016/08/equivalence-of-unicode-strings-is.html) Originally[HyPer](http://wwww.hyper-db.com/)had a very simple model for strings: We made sure that all strings are valid[UTF\-8](https://en.wikipedia.org/wiki/UTF-8), but otherwise did not really care about the intrinsics of Unicode\. And that is actually a quite sane model[for most applications](http://utf8everywhere.org/)\. Usually we do not care about the precise string structure anyway, and in the few places that we do \(e\.g\.,[strpos](https://www.postgresql.org/docs/9.4/static/functions-string.html)and[substr](https://www.postgresql.org/docs/9.4/static/functions-string.html)\), we add some extra logic to handle UTF\-8 correctly\. That was all good until users started complaining about sort orders\. When sorting UTF\-8 naively, we sort by code point value, which is often ok but not always what users want\. So we added support for[collations:](https://www.postgresql.org/docs/9.1/static/collation.html) ``` select * from foo order by x collate "de_DE"; ``` And that is where life starts getting interesting\. The collate statement can either be given explicitly at any place in the query, as shown above, or added in a*create table*statement, giving a default collation for a column\. So basically every value that is ordered or compared within a SQL statement can have an associated collation\. Initially I naively though that this would just affect the comparison operation, like calling[strcoll](http://en.cppreference.com/w/cpp/string/byte/strcoll)instead of[strcmp](http://en.cppreference.com/w/cpp/string/byte/strcmp), but life is unfortunately much more complex\. First, the[Unicode collation algorithm](http://www.unicode.org/reports/tr10/)is quite involved, translating the input sequence into a sequence of 3 \(or 4\) level of weights before comparisons, and second, some insane database systems default to[case\-insensitive collations](https://msdn.microsoft.com/en-us/library/ms143726.aspx), and some users actually want that behavior\. Why is case\-insensitivity insane? Because it leads to strange semantics\. First of all, the whole Unicode weighting mechanism is strange\. Some of the strangeness can be hidden by using[ICU](http://site.icu-project.org/), but you still get strange results\. Consider for example the following two strings \(\\u is a Unicode escape\): abc abc\\u007F Clearly they are different, right? Well, if you ask the[ICU Collation Demo](http://demo.icu-project.org/icu-bin/collation.html)they are not, they are considered equal\. The reason for that is the entry in the[DUCET table](http://www.unicode.org/Public/UCA/latest/allkeys.txt), which for 007F says ``` 007F ; [.0000.0000.0000] # DELETE (in ISO 6429) ``` So this character must be ignored for comparisons\. And there are other code points that are relevant for plain comparisons, but not for accent\-insensitive collations, etc\. A lot of fun, in particular if you want to implement hash\-based algorithms\. But technical problems aside, is that really what users want? Do users expect these two strings to be equal, just as they apparently expect 'abc' and 'äbc' to compare to equal in accent\-insensitive collation? And what about queries? Consider for example ``` select x,count(*) from (values('abc'),('ABC'),('äbc')) s(x) group by x collate "de_DE_ci_ai"; ``` which perform the group by in a case\-insensitive, accent\-insensitive manner, which means that all three strings compare as equal\. What is the result of that? abc 3? ABC 3? äbc 3? All three would be valid answers, as the three are "equal" according to the chosen collation\. And the result might even be non\-deterministic, if the query is parallelized across threads and the first value wins\. Do users really want that? Well, apparently they do, at least I was told that, and some systems even default to case\-insensitive behavior\. But I find that strange, the semantic of these queries can be quite puzzling\.

Similar Articles

Unicode composition for filenames

Lobsters Hottest

This document discusses the challenges of Unicode filename composition (NFC vs NFD) across different operating systems in the context of the Subversion version control system, and proposes solutions for handling these differences.

My Favorite Bugs: Invalid Surrogate Pairs

Hacker News Top

A blog post recounting a bug where inserting adjacent multi-byte emoji caused a splice in a CRDT library, splitting a surrogate pair and silently breaking collaborative editor synchronization.

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

arXiv cs.CL

This paper investigates the relationship between training scale and UTF-8 generation reliability in byte-level language models, finding that UTF-8 validity convergence lags behind perplexity by roughly a factor of two. The authors introduce evaluation protocols to isolate structural validity and show that reliable UTF-8 generation is a distinct capability requiring separate evaluation.