Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing I don't like about Rust is how taking a slice of a string can cause a runtime panic if the start or end of the slice ends up intersecting a multi-byte UTF-8 char.

I would prefer it if this feature didn't exist at all rather than cause runtime panics.

https://play.rust-lang.org/?gist=e02ce5e9aacfee3a2b4917d5624...



It's not a problem in practice, because you'd use something like `.char_indices()` iterator, or result from a substring search, etc. to get correct offsets in the first place.

It's not useful to blindly read at random offsets in UTF-8 strings. If it didn't panic, you'd get garbage. If offsets were automatically moved to skip over garbage, you wouldn't know what you're getting, and your overall algorithm would likely end up with nonsense (duplicated or skipped chars).

For algorithms that don't care about characters or UTF-8 validity, there's zero-cost `.as_bytes()`.


Couldn't syntax like `a_string[..3]` be made to result in compilation errors in Rust? Since that'd almost always be a bug? (right?)

And in the rare cases, when it's not a bug, then one can just use `as_bytes` which would be good to do in any case, to indicate to other humans that this is not a bug.

B.t.w. I love the error message `[..3]` generates: "thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside '早' (bytes 2..5) of `ab早`'" — I've never seen such easy to understand error messages in any language (except for in a few cases in Scala).


We could have never implemented Index for String, sure. We have though, so removing it would be a breaking change.


Ok (Maybe a compile time warning? that doesn't break the build)


That could be done, if it was agreed that this is a mis-feature. I don't think there's agreement on that, though.


What does zero-cost mean in this context? It must cost something to run, no? Or is it basically a compiler hint instructing the next function to treat the data as pure bytes?


In this particular context, you can think of going from a `&str` to a `&[u8]` via `string.as_bytes()` as a safe cast. The in-memory representation remains the same, and the function call will almost certainly be inlined because its implementation is trivial.


It is a common pattern in Rust to use [] for things that cannot fail and will panic otherwise and a method for things that can fail and return Option or Result.

e.g. my_hashmap["foo"] will panic at runtime if the key "foo" is not present, or return the associated value if it is. But my_hashmap.get("foo") will return None if "foo" is not present and Some(value) if it is.


What's the point of the [] version then? It seems inherently more dangerous, and Rust emphasizes safety. I know it wants to be pragmatic as well as safe, but this seems like a strange default.


There's a few things that come into play here:

First of all, panics are perfectly safe. None of this has to do with safety guarantees.

Second, the [] syntax is controlled by the Index trait, which returns an &T, not an Option<&T>. It does this due to Rust's error handling philosophy. There's two kinds of errors: recoverable and unrecoverable errors. When something shouldn't fail, unless there's a bug, you shouldn't be using Option/Result, you should panic. When something may normally fail, and you want to be able to handle that explicitly, you should use Option/Result.

If [] always returned an Option, you'd be seeing tons and tons and tons of unwraps. It's not the right default here. However, that's why the .get method also exists: If you do think that this may fail, but not due to a bug, then you should use .get instead, which does give you an option.

TL;DR: everything is tradeoffs, and we picked a specific set of them, and that's how they all play out together.

Personal commentary: this is the kind of thing that's largely concerning until you actually use the language more, IMHO. Dealing with Options all the time here would feel really bad. Consider the other sub-thread about floats; it often feels like boilerplate for no good reason. That would introduce this for every single time you want to index something, which is a very common operation.


Does Rust support a monadic coding style (like Haskell "do" blocks or F# computation expressions)? That would allow you to work with Options without having to explicitly unwrap them.


Yes, there are a bunch of methods that let you do this, though with a bit more syntax than do notation; for example, and_then is pretty much bind.


Not generic monads, but it does have the `?` operator for Option (similar to Haskell Maybe) and Result (similar to Haskell Either) which would support a similar syntax to using `do` with the Maybe monad


Scala programmers would recognize this as the difference between () and .get(). I hope rust copied scalas syntax- its much cleaner, rather than trying to be nice to the established system languages (c/c++)

This would also free up the [] to be used for generics and avoid syntactical warts like ::<> parsin


We did have [] for generics, but we changed it back.

It doesn't remove those warts, it moves them.


Scala and C++ syntax are rather similar, no?


Python does something similar with [] vs .get()


Taken from C++ I guess, which does the same thing.


TIL! I'm still learning Rust so it's good to learn this now! Thanks!



This seems specious to me. The only way to get an invalid index in a string in any language is that you either have an array index arithmetic error or you are blindly operating on a string you haven't validated.

If you want all the data after a : character, you slice on the index of the :. The character after it is going to be the beginning of a UTF-8 character.

You do not under any circumstances guess that the colon is at position 6 in the string. That's not safe. Why are you going cowboy in a language that is so obsessed with safety?


I just realized that I have bug in my GPS driver. It operates on ASCII data, so [] operator is safe, BUT data can be corrupted (low chance, but non-zero), so it can form valid multibyte character, so my code will panic on it, trying to parse and validate NMEA message.


Panicking on parsing corrupted data seem like a feature to me...

It's like the default rule in a lexer, if it ever gets to it then it's an unrecognized character and lexing stops so error handling can proceed.

--edit--

Which I now realize was probably your point.


Truncating a string to fit in a fixed-size storage field is probably the most common reason to split at a particular byte position. If you’re throwing data away anyway, you probably don’t care too much about the little bit of corruption.

Granted, this is certainly incorrect but has little to do with safety, especially if the downstream code has to revalidate everything anyway.


String slicing using byte indices has to exist in some form, since it is the only thing that is efficient (O(1)). But, I guess it could have used syntax other than somestring[...].


It could slice on bytes and return a slice of bytes since the String type is a wrapper over Vec<u8>.


That means one loses all the conveniences and guarantees of the string types and, in many cases, forces an immediate revalidation the byte slice as UTF-8 to get back to &str, which is O(n). Furthermore, this is also rather clunky.

I suppose one could have it return StrWithInvalidSurrounds, where just the first (at most) 3 and last (at most) 3 bytes might be invalid, which would then allow for O(1) revalidation to a &str, and even other operations like continuing to slice... But this is even more clunky for actual use!

I think a moderately less clunky API might have been to not use integers for byte indexing, but instead some ByteIndex wrapper type that string operations return, meaning one can't just write `s[..5]` in an attempt to get the first 5 characters of the string.

(Also, there's str::get that returns an Option: https://doc.rust-lang.org/std/primitive.str.html#method.get )


If you want to just slice on bytes without any String semantics, why not use Vec<u8> then? String implies that it is, well, a string.


Does this bug exist because it would be too expensive to check every string before slicing? (Being Rust-ignorant), can you not type a binary as UTF-8? Are there 2 versions of string functions, fast ones that assume ASCII and slow ones that assume UTF-8?


Every string is checked. But UTF8 is a multi-byte encoding, and slicing works per bytes, so you if you slice in the middle of a multi-byte character, you may get nonsense. The error happens because of this checking, not in spite of it.

String always assumes full UTF-8. You could make an AsciiString type if you wanted, but it's not provided by the standard library.


The obvious follow up question would be: so why is slicing a string a byte-wise operation and not a character-wise operation? If a string is an array of characters, why does it let me refer to individual bytes without explicitly casting it to a byte array? How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.


> How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.

It's exactly the opposite of what you expect. Getting the nth codepoint is often (not always) semantically incorrect since a codepoint isn't necessarily one character. Multiple codepoints might combine to form one character. (In Unicode, these are called grapheme clusters.)

Byte offsets are used a ton because you might often have the index to a position in the string from some routine, like, say, a search[1].

I've been working on text related things in both Rust and Go for several years. Both languages got this part of their strings exactly right given that their representation in memory is always a sequence of bytes.

[1] - https://doc.rust-lang.org/std/primitive.str.html#method.find


I still think that using the common [] operator for this is a mistake. Strings shouldn't offer [] at all, and instead should provide methods like codepoints(), bytes(), grapheme_clusters() etc for indexing, slicing, and iterating.

The reason being that the behavior of [] for string varies widely in different languages, and so this is something that's best made explicit, both to force the author of the code to consider whether their assumptions are valid and reasonable for what they're trying to do, and to give additional context to anyone else reading the code.

As it is, I suspect a common class of bugs for Rust will be with people assuming that [] slices codepoints, because it seems to work that way for ASCII.


I'm quite thankful that Rust has succinct notation for slicing strings. Do note that `string[n]` is not supported, so you'll stumble over an inconsistency in your mental model quite quickly if you think slicing is by codepoint.


The lack of direct indexing is a good point. But strings aren't sliced on byte boundaries all that often either - it's far more common to use higher-level APIs like split(), that deal with offsets under the hood, so that sugar mostly ends up being used in the implementation of such APIs. And, really, would something like s.slice_u8(x, y) be that unwieldy over s[x..y]?


How often do you actually want the nth character as opposed to the nth grapheme?

There is pretty much no case where indexing by character actually makes sense because it is almost always incorrect and it is always inefficient.

Indexing by byte is rarely useful, but it does have some usefulness since it can be used correctly and efficiently since you can easily find the next or previous character by searching a maximum of four bytes for the a byte that has a MSB of 0. If you want to do something like get a &str that would fit in a n-byte buffer, then byte indices will let you do that efficiently and correctly.


As stated below, indexing is an O(1) operation, and that is a O(n) operation.

> If a string is an array of characters

It is not, it is an array (technically vector) of bytes.


Who cares if it's O(1) if it causes a panic? What good is high performance if it doesn't complete or isn't safe?

At the very least, shouldn't there be an O(n) method to do character-wise slicing?


panics are safe. You expect the “I don’t have a bug” case to be fast.

You can, but it depends on what you mean by “character”, as that’s not a concept in Unicode. Every kind of thing you could mean has a method, specific to it, since they’re different things.

(char in Rust is a Unicode scalar value, and you can collect into a Vec<char> and then slice it, as an example of one of those things. And that’s still O(1) at the cost of using up to four times the memory.)


Why not this?

  fn main() {
      let a = "ab早".as_bytes();
      let a = &a[..3];
      println!("Hello, world!");
  }


It's not clear to me what you're suggesting; is it that String shouldn't have supported indexing in the first place? That code does work, but you have a &[u8] not a &str.


But neither AsciiString. It has a as_str method, but it's still a kludge.

This example was basically a suggestion to throw0u1t: if they want to cut in the middle of the utf-sequence for whatever reason, they can [edit:] do it without extra crates.

What I don't understand is why slices are indexed in bytes and not in objects. If String has an ability to check that we're cutting in the middle of the character sequence, why doesn't it provide an ability to take 3 fully formed characters.


I think the rust designers want to keep the implicit contract that indexing into a string is fast and O(1).

If you want to find the one millionth codepoint of a UTF8-encoded string, you have to more or less (1) visit every byte of the string.

If, on the other hand, you want to find the codepoint that covers the millionth byte, on the other hand, you have to read at most four bytes (read the millionth byte, and there are three cases:

- it’s a full codepoint. If so, you‘re done.

- it is the first byte of a multi-byte codepoint. If so, read forwards in the string for up to 3 continuation characters.

- it is a continuation character. If so, search backwards in the string for the first byte, then, if necessary, read forwards to find more continuation characters.

So, that is O(1)

(1) you can skip continuation characters, but these typically are rare.


> What I don't understand is why slices are indexed in bytes and not in objects.

Slicing is an O(1) operation, and that would be an O(n) operation.


It does: `s.chars().take(3)`. It just does it with iterators rather than with indexes because that better communicates the performance characteristics.


I think he's suggesting that slicing on strings should be by character, and if you want to slice on bytes, you should explicitly ask to treat the string as a byte array. It makes more sense semantically, and it's safe.


Slicing on characters is a linear time operation and indexing is meant to be cheap.


That seems like taking it too far. It's like using pointer arithmetic to index a linked list on the assumption that the nodes happen to be allocated contiguously in memory. I mean, I guess the thinking is, indexing a Unicode string isn't cheap, but indexing strings used to be cheap once upon a time, when strings were encoded in fixed one-byte-per-character representations, so let's pretend that's still the case and panic if it doesn't work out.... That's weirdly antithetical to Rust's purported focus on safety.

Also, you can get the same performance from an operation that returns a byte array instead of a string. If that kind of performance is what you want, then a Unicode string is simply not the right type to use.


Indexing a Unicode string is cheap... if you have a byte index. If you want to count out some fixed number of codepoints, then of course you've just moved the cost to calculating the corresponding byte index. But counting codepoints is almost always the wrong thing to do anyway [1]. In practice, it's more common to obtain indices by inspecting the string itself, e.g. searching for a substring or regex match. In that case, it's faster for the search to just return a byte index; there's no benefit to having it return a codepoint index, and then having to do an O(n) lookup when you try to use the index. And byte indices obtained that way will always be valid character boundaries, so you can use [] without worrying about panics.

You suggest just using a byte array instead, but then you'd lose the guarantee that what you're working with is valid Unicode. Contrary to your assertion, it is useful to have a type that provides that guarantee, yet which can still be operated on efficiently.

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...


Safety is about memory safety. Immediately exiting your program is about as memory safe as it gets.


Panics are not unsafe. Panic exists in Rust because they are safe. If you don't want a panic on index, just don't index.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway, because it is an abstraction of text that doesn't provide support to the notion that a "character" is more fundamental than a word or paragraph, etc. Rust's string slicing exists solely to make ASCII text easy to handle. If your text is not ASCII, then you shouldn't be slicing it at all. Thus the panic.


Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway

If that's true, isn't it the job of a type system to help avoid such nonsensical operations? If "slice" only makes sense for byte arrays and ASCII strings, it could be provided on those types without being defined on UTF-8 strings.

Panics are not unsafe. Panic exists in Rust because they are safe.

That's "safe" by a very limited definition of safety. It's one step up from undefined behavior, granted, but it's not a very high standard. In practice, in most programs, you'd want to ensure that such a panic would never happen, and personally I think the language's unhelpfulness in that regard is a wart.


>If that's true, isn't it the job of a type system to help avoid such nonsensical operations?

It's not strictly true, because there are situations where you want to slice UTF-8. For instance, if you already know where the code point boundaries are for newlines. But if you know that, then you've run something like a regex with >O(1) behavior and you certainly wouldn't want string slicing to do redundant work.

>hat's "safe" by a very limited definition of safety

That's the definition of safe that is used. Safety in the context of Rust means memory safety. (Division can panic, btw.) If you don't see why undefined behavior is so much worse than a panic, then do some research on it. If you want programs that never fail, you need a comprehensive plan that takes into account things like hardware failure. A programming language can't do that.


I think that's too extreme. There are many legitimate reasons to slice non-ASCII text - for example, to split it on newlines.


That's not trivial and different languages vary in how they handle new line characters even. https://stackoverflow.com/questions/44995851/how-do-i-check-...


You can still split non-ASCII text on ASCII newlines, and quite often that's exactly what needs to be done.


And usually, you don't want it to cost O(n) on top of whatever parser you ran to find those newlines.


Go indexes bytes on strings, even though there's a builtin type called Rune which delimits utf-8 codepoints. This is yet another footgun. Is there a language that doesn't handle this poorly?

https://play.golang.org/p/CkBp0w8T621


In Rust, you're supposed to use `unicode-segmentation`[1] if you need to split on logical character (grapheme cluster in the Unicode standard). Otherwise, the iterators `.bytes` emits raw bytes, and `.chars` emits UTF-8 codepoints.

Basically, string indexing is a lot harder than it seems at first glance, depending on what you want.


One nitpick: `.chars`[1] gives you an iterable[2] of `char`s[3], each of which is always a 4 byte representation of a valid unicode character. This means that `"asdf".chars().collect()` will have a different size to `"asdf"` and `"asdf".chars().as_str()`. `.chars()` will never give you an incomplete codepoint, but it will give you incomplete characters, as you could have many c̶̼̟̏ó̷̘̉n̴̖̞̏̇t̸̡̃ĭ̸̻̬n̴̯͉̂͑ṵ̴̑a̷̛̫̳ẗ̸͕́i̷̱̫̓̋ǫ̸̑ǹ̶̼̅s̸̩̾̌ to represent what visually are a single char.

[1]: https://doc.rust-lang.org/std/string/struct.String.html#meth...

[2]: https://doc.rust-lang.org/std/str/struct.Chars.html

[3]: https://doc.rust-lang.org/std/primitive.char.html


> visually are a single char

IIRC that's what grapheme clusters are for.


UTF-8 is at odds with efficient array indexing. I like pythons approach where bytes and strings are distinct types, though I have no idea what it is doing under the hood.


Modern Python uses whatever representation is sufficient to ensure 1-unit-per-codepoint for a given string (which it can do on creation, since strings are immutable). So you get ASCII, UTF-16 sans surrogate pairs, or UTF-32.

This is great for high-level code, but painful to work with from native code, because it usually needs some specific encoding to call into other libraries, and it's usually UTF-8 - so you need to re-encode all the time.


I actually had to work with Python strings at the C level recently, and their approach is pretty clever. IIRC, the runtime can take any common form of Unicode, and will store it. When you access that string, the accessor requests a specific encoding, and the runtime will convert if need be, and then store it in the string object.

So it handles the (very) common case of needing the same encoding multiple times (e.g. for all file paths on Windows), while not introducing too much overhead in memory or speed.

I could be mistaken on exact details though, especially since I recall there being multiple implementations even within py3.x.


Any idea how it handles indexing? Does it convert everything to 32 bit chars and ignore graphemes?


Go allows slicing UTF8 strings just fine: https://play.golang.org/p/eUQ5L58KwZy




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: