One thing I don't like about Rust is how taking a slice of a string can cause a ...

pornel · on Jan 10, 2019

It's not a problem in practice, because you'd use something like `.char_indices()` iterator, or result from a substring search, etc. to get correct offsets in the first place.

It's not useful to blindly read at random offsets in UTF-8 strings. If it didn't panic, you'd get garbage. If offsets were automatically moved to skip over garbage, you wouldn't know what you're getting, and your overall algorithm would likely end up with nonsense (duplicated or skipped chars).

For algorithms that don't care about characters or UTF-8 validity, there's zero-cost `.as_bytes()`.

KajMagnus · on Jan 10, 2019

Couldn't syntax like `a_string[..3]` be made to result in compilation errors in Rust? Since that'd almost always be a bug? (right?)

And in the rare cases, when it's not a bug, then one can just use `as_bytes` which would be good to do in any case, to indicate to other humans that this is not a bug.

B.t.w. I love the error message `[..3]` generates: "thread 'main' panicked at 'byte index 3 is not a char boundary; it is inside '早' (bytes 2..5) of `ab早`'" — I've never seen such easy to understand error messages in any language (except for in a few cases in Scala).

steveklabnik · on Jan 10, 2019

We could have never implemented Index for String, sure. We have though, so removing it would be a breaking change.

KajMagnus · on Jan 11, 2019

Ok (Maybe a compile time warning? that doesn't break the build)

steveklabnik · on Jan 11, 2019

That could be done, if it was agreed that this is a mis-feature. I don't think there's agreement on that, though.

stavros · on Jan 10, 2019

What does zero-cost mean in this context? It must cost something to run, no? Or is it basically a compiler hint instructing the next function to treat the data as pure bytes?

burntsushi · on Jan 10, 2019

In this particular context, you can think of going from a `&str` to a `&[u8]` via `string.as_bytes()` as a safe cast. The in-memory representation remains the same, and the function call will almost certainly be inlined because its implementation is trivial.

gamegoblin · on Jan 9, 2019

It is a common pattern in Rust to use [] for things that cannot fail and will panic otherwise and a method for things that can fail and return Option or Result.

e.g. my_hashmap["foo"] will panic at runtime if the key "foo" is not present, or return the associated value if it is. But my_hashmap.get("foo") will return None if "foo" is not present and Some(value) if it is.

sephoric · on Jan 9, 2019

What's the point of the [] version then? It seems inherently more dangerous, and Rust emphasizes safety. I know it wants to be pragmatic as well as safe, but this seems like a strange default.

steveklabnik · on Jan 9, 2019

There's a few things that come into play here:

First of all, panics are perfectly safe. None of this has to do with safety guarantees.

Second, the [] syntax is controlled by the Index trait, which returns an &T, not an Option<&T>. It does this due to Rust's error handling philosophy. There's two kinds of errors: recoverable and unrecoverable errors. When something shouldn't fail, unless there's a bug, you shouldn't be using Option/Result, you should panic. When something may normally fail, and you want to be able to handle that explicitly, you should use Option/Result.

If [] always returned an Option, you'd be seeing tons and tons and tons of unwraps. It's not the right default here. However, that's why the .get method also exists: If you do think that this may fail, but not due to a bug, then you should use .get instead, which does give you an option.

TL;DR: everything is tradeoffs, and we picked a specific set of them, and that's how they all play out together.

Personal commentary: this is the kind of thing that's largely concerning until you actually use the language more, IMHO. Dealing with Options all the time here would feel really bad. Consider the other sub-thread about floats; it often feels like boilerplate for no good reason. That would introduce this for every single time you want to index something, which is a very common operation.

brianberns · on Jan 9, 2019

Does Rust support a monadic coding style (like Haskell "do" blocks or F# computation expressions)? That would allow you to work with Options without having to explicitly unwrap them.

steveklabnik · on Jan 9, 2019

Yes, there are a bunch of methods that let you do this, though with a bit more syntax than do notation; for example, and_then is pretty much bind.

TheCoelacanth · on Jan 10, 2019

Not generic monads, but it does have the `?` operator for Option (similar to Haskell Maybe) and Result (similar to Haskell Either) which would support a similar syntax to using `do` with the Maybe monad

fxfan · on Jan 10, 2019

Scala programmers would recognize this as the difference between () and .get(). I hope rust copied scalas syntax- its much cleaner, rather than trying to be nice to the established system languages (c/c++)

This would also free up the [] to be used for generics and avoid syntactical warts like ::<> parsin

steveklabnik · on Jan 10, 2019

We did have [] for generics, but we changed it back.

It doesn't remove those warts, it moves them.

paulddraper · on Jan 10, 2019

Scala and C++ syntax are rather similar, no?

AlphaSite · on Jan 10, 2019

Python does something similar with [] vs .get()

okasaki · on Jan 10, 2019

Taken from C++ I guess, which does the same thing.

throw0u1t · on Jan 9, 2019

TIL! I'm still learning Rust so it's good to learn this now! Thanks!

test9753 · on Jan 10, 2019

One approach to solve the slicing issue: https://play.rust-lang.org/?version=stable&mode=debug&editio...

hinkley · on Jan 10, 2019

This seems specious to me. The only way to get an invalid index in a string in any language is that you either have an array index arithmetic error or you are blindly operating on a string you haven't validated.

If you want all the data after a : character, you slice on the index of the :. The character after it is going to be the beginning of a UTF-8 character.

You do not under any circumstances guess that the colon is at position 6 in the string. That's not safe. Why are you going cowboy in a language that is so obsessed with safety?

v_lisivka · on Jan 10, 2019

I just realized that I have bug in my GPS driver. It operates on ASCII data, so [] operator is safe, BUT data can be corrupted (low chance, but non-zero), so it can form valid multibyte character, so my code will panic on it, trying to parse and validate NMEA message.

UncleEntity · on Jan 10, 2019

Panicking on parsing corrupted data seem like a feature to me...

It's like the default rule in a lexer, if it ever gets to it then it's an unrecognized character and lexing stops so error handling can proceed.

--edit--

Which I now realize was probably your point.

kd5bjo · on Jan 10, 2019

Truncating a string to fit in a fixed-size storage field is probably the most common reason to split at a particular byte position. If you’re throwing data away anyway, you probably don’t care too much about the little bit of corruption.

Granted, this is certainly incorrect but has little to do with safety, especially if the downstream code has to revalidate everything anyway.

dbaupp · on Jan 9, 2019

String slicing using byte indices has to exist in some form, since it is the only thing that is efficient (O(1)). But, I guess it could have used syntax other than somestring[...].

throw0u1t · on Jan 9, 2019

It could slice on bytes and return a slice of bytes since the String type is a wrapper over Vec<u8>.

dbaupp · on Jan 9, 2019

That means one loses all the conveniences and guarantees of the string types and, in many cases, forces an immediate revalidation the byte slice as UTF-8 to get back to &str, which is O(n). Furthermore, this is also rather clunky.

I suppose one could have it return StrWithInvalidSurrounds, where just the first (at most) 3 and last (at most) 3 bytes might be invalid, which would then allow for O(1) revalidation to a &str, and even other operations like continuing to slice... But this is even more clunky for actual use!

I think a moderately less clunky API might have been to not use integers for byte indexing, but instead some ByteIndex wrapper type that string operations return, meaning one can't just write `s[..5]` in an attempt to get the first 5 characters of the string.

(Also, there's str::get that returns an Option: https://doc.rust-lang.org/std/primitive.str.html#method.get )

int_19h · on Jan 10, 2019

If you want to just slice on bytes without any String semantics, why not use Vec<u8> then? String implies that it is, well, a string.

pmarreck · on Jan 9, 2019

Does this bug exist because it would be too expensive to check every string before slicing? (Being Rust-ignorant), can you not type a binary as UTF-8? Are there 2 versions of string functions, fast ones that assume ASCII and slow ones that assume UTF-8?

steveklabnik · on Jan 9, 2019

Every string is checked. But UTF8 is a multi-byte encoding, and slicing works per bytes, so you if you slice in the middle of a multi-byte character, you may get nonsense. The error happens because of this checking, not in spite of it.

String always assumes full UTF-8. You could make an AsciiString type if you wanted, but it's not provided by the standard library.

da_chicken · on Jan 9, 2019

The obvious follow up question would be: so why is slicing a string a byte-wise operation and not a character-wise operation? If a string is an array of characters, why does it let me refer to individual bytes without explicitly casting it to a byte array? How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.

burntsushi · on Jan 9, 2019

> How often comparatively do you want the nth byte compared to the nth character? I would suspect that's pretty rare.

It's exactly the opposite of what you expect. Getting the nth codepoint is often (not always) semantically incorrect since a codepoint isn't necessarily one character. Multiple codepoints might combine to form one character. (In Unicode, these are called grapheme clusters.)

Byte offsets are used a ton because you might often have the index to a position in the string from some routine, like, say, a search[1].

I've been working on text related things in both Rust and Go for several years. Both languages got this part of their strings exactly right given that their representation in memory is always a sequence of bytes.

[1] - https://doc.rust-lang.org/std/primitive.str.html#method.find

int_19h · on Jan 10, 2019

I still think that using the common [] operator for this is a mistake. Strings shouldn't offer [] at all, and instead should provide methods like codepoints(), bytes(), grapheme_clusters() etc for indexing, slicing, and iterating.

The reason being that the behavior of [] for string varies widely in different languages, and so this is something that's best made explicit, both to force the author of the code to consider whether their assumptions are valid and reasonable for what they're trying to do, and to give additional context to anyone else reading the code.

As it is, I suspect a common class of bugs for Rust will be with people assuming that [] slices codepoints, because it seems to work that way for ASCII.

burntsushi · on Jan 10, 2019

I'm quite thankful that Rust has succinct notation for slicing strings. Do note that `string[n]` is not supported, so you'll stumble over an inconsistency in your mental model quite quickly if you think slicing is by codepoint.

int_19h · on Jan 10, 2019

The lack of direct indexing is a good point. But strings aren't sliced on byte boundaries all that often either - it's far more common to use higher-level APIs like split(), that deal with offsets under the hood, so that sugar mostly ends up being used in the implementation of such APIs. And, really, would something like s.slice_u8(x, y) be that unwieldy over s[x..y]?

TheCoelacanth · on Jan 10, 2019

How often do you actually want the nth character as opposed to the nth grapheme?

There is pretty much no case where indexing by character actually makes sense because it is almost always incorrect and it is always inefficient.

Indexing by byte is rarely useful, but it does have some usefulness since it can be used correctly and efficiently since you can easily find the next or previous character by searching a maximum of four bytes for the a byte that has a MSB of 0. If you want to do something like get a &str that would fit in a n-byte buffer, then byte indices will let you do that efficiently and correctly.

steveklabnik · on Jan 9, 2019

As stated below, indexing is an O(1) operation, and that is a O(n) operation.

> If a string is an array of characters

It is not, it is an array (technically vector) of bytes.

da_chicken · on Jan 10, 2019

Who cares if it's O(1) if it causes a panic? What good is high performance if it doesn't complete or isn't safe?

At the very least, shouldn't there be an O(n) method to do character-wise slicing?

steveklabnik · on Jan 10, 2019

panics are safe. You expect the “I don’t have a bug” case to be fast.

You can, but it depends on what you mean by “character”, as that’s not a concept in Unicode. Every kind of thing you could mean has a method, specific to it, since they’re different things.

(char in Rust is a Unicode scalar value, and you can collect into a Vec<char> and then slice it, as an example of one of those things. And that’s still O(1) at the cost of using up to four times the memory.)

yurymik · on Jan 9, 2019

Why not this?

  fn main() {
      let a = "ab早".as_bytes();
      let a = &a[..3];
      println!("Hello, world!");
  }

steveklabnik · on Jan 9, 2019

It's not clear to me what you're suggesting; is it that String shouldn't have supported indexing in the first place? That code does work, but you have a &[u8] not a &str.

yurymik · on Jan 9, 2019

But neither AsciiString. It has a as_str method, but it's still a kludge.

This example was basically a suggestion to throw0u1t: if they want to cut in the middle of the utf-sequence for whatever reason, they can [edit:] do it without extra crates.

What I don't understand is why slices are indexed in bytes and not in objects. If String has an ability to check that we're cutting in the middle of the character sequence, why doesn't it provide an ability to take 3 fully formed characters.

Someone · on Jan 9, 2019

I think the rust designers want to keep the implicit contract that indexing into a string is fast and O(1).

If you want to find the one millionth codepoint of a UTF8-encoded string, you have to more or less (1) visit every byte of the string.

If, on the other hand, you want to find the codepoint that covers the millionth byte, on the other hand, you have to read at most four bytes (read the millionth byte, and there are three cases:

- it’s a full codepoint. If so, you‘re done.

- it is the first byte of a multi-byte codepoint. If so, read forwards in the string for up to 3 continuation characters.

- it is a continuation character. If so, search backwards in the string for the first byte, then, if necessary, read forwards to find more continuation characters.

So, that is O(1)

(1) you can skip continuation characters, but these typically are rare.

steveklabnik · on Jan 9, 2019

> What I don't understand is why slices are indexed in bytes and not in objects.

Slicing is an O(1) operation, and that would be an O(n) operation.

TheCoelacanth · on Jan 10, 2019

It does: `s.chars().take(3)`. It just does it with iterators rather than with indexes because that better communicates the performance characteristics.

dkarl · on Jan 9, 2019

I think he's suggesting that slicing on strings should be by character, and if you want to slice on bytes, you should explicitly ask to treat the string as a byte array. It makes more sense semantically, and it's safe.

richardwhiuk · on Jan 9, 2019

Slicing on characters is a linear time operation and indexing is meant to be cheap.

dkarl · on Jan 9, 2019

That seems like taking it too far. It's like using pointer arithmetic to index a linked list on the assumption that the nodes happen to be allocated contiguously in memory. I mean, I guess the thinking is, indexing a Unicode string isn't cheap, but indexing strings used to be cheap once upon a time, when strings were encoded in fixed one-byte-per-character representations, so let's pretend that's still the case and panic if it doesn't work out.... That's weirdly antithetical to Rust's purported focus on safety.

Also, you can get the same performance from an operation that returns a byte array instead of a string. If that kind of performance is what you want, then a Unicode string is simply not the right type to use.

comex · on Jan 10, 2019

Indexing a Unicode string is cheap... if you have a byte index. If you want to count out some fixed number of codepoints, then of course you've just moved the cost to calculating the corresponding byte index. But counting codepoints is almost always the wrong thing to do anyway [1]. In practice, it's more common to obtain indices by inspecting the string itself, e.g. searching for a substring or regex match. In that case, it's faster for the search to just return a byte index; there's no benefit to having it return a codepoint index, and then having to do an O(n) lookup when you try to use the index. And byte indices obtained that way will always be valid character boundaries, so you can use [] without worrying about panics.

You suggest just using a byte array instead, but then you'd lose the guarantee that what you're working with is valid Unicode. Contrary to your assertion, it is useful to have a type that provides that guarantee, yet which can still be operated on efficiently.

[1] https://manishearth.github.io/blog/2017/01/14/stop-ascribing...

empath75 · on Jan 10, 2019

Safety is about memory safety. Immediately exiting your program is about as memory safe as it gets.

Retra · on Jan 9, 2019

Panics are not unsafe. Panic exists in Rust because they are safe. If you don't want a panic on index, just don't index.

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway, because it is an abstraction of text that doesn't provide support to the notion that a "character" is more fundamental than a word or paragraph, etc. Rust's string slicing exists solely to make ASCII text easy to handle. If your text is not ASCII, then you shouldn't be slicing it at all. Thus the panic.

dkarl · on Jan 10, 2019

Indexing into a UTF-8 string doesn't serve any reasonable consistent purpose anyway

If that's true, isn't it the job of a type system to help avoid such nonsensical operations? If "slice" only makes sense for byte arrays and ASCII strings, it could be provided on those types without being defined on UTF-8 strings.

Panics are not unsafe. Panic exists in Rust because they are safe.

That's "safe" by a very limited definition of safety. It's one step up from undefined behavior, granted, but it's not a very high standard. In practice, in most programs, you'd want to ensure that such a panic would never happen, and personally I think the language's unhelpfulness in that regard is a wart.

Retra · on Jan 10, 2019

>If that's true, isn't it the job of a type system to help avoid such nonsensical operations?

It's not strictly true, because there are situations where you want to slice UTF-8. For instance, if you already know where the code point boundaries are for newlines. But if you know that, then you've run something like a regex with >O(1) behavior and you certainly wouldn't want string slicing to do redundant work.

>hat's "safe" by a very limited definition of safety

That's the definition of safe that is used. Safety in the context of Rust means memory safety. (Division can panic, btw.) If you don't see why undefined behavior is so much worse than a panic, then do some research on it. If you want programs that never fail, you need a comprehensive plan that takes into account things like hardware failure. A programming language can't do that.

int_19h · on Jan 10, 2019

I think that's too extreme. There are many legitimate reasons to slice non-ASCII text - for example, to split it on newlines.

mlindner · on Jan 10, 2019

That's not trivial and different languages vary in how they handle new line characters even. https://stackoverflow.com/questions/44995851/how-do-i-check-...

int_19h · on Jan 10, 2019

You can still split non-ASCII text on ASCII newlines, and quite often that's exactly what needs to be done.

Retra · on Jan 10, 2019

And usually, you don't want it to cost O(n) on top of whatever parser you ran to find those newlines.

jzelinskie · on Jan 9, 2019

Go indexes bytes on strings, even though there's a builtin type called Rune which delimits utf-8 codepoints. This is yet another footgun. Is there a language that doesn't handle this poorly?

https://play.golang.org/p/CkBp0w8T621

moosingin3space · on Jan 9, 2019

In Rust, you're supposed to use `unicode-segmentation`[1] if you need to split on logical character (grapheme cluster in the Unicode standard). Otherwise, the iterators `.bytes` emits raw bytes, and `.chars` emits UTF-8 codepoints.

Basically, string indexing is a lot harder than it seems at first glance, depending on what you want.

estebank · on Jan 10, 2019

One nitpick: `.chars`[1] gives you an iterable[2] of `char`s[3], each of which is always a 4 byte representation of a valid unicode character. This means that `"asdf".chars().collect()` will have a different size to `"asdf"` and `"asdf".chars().as_str()`. `.chars()` will never give you an incomplete codepoint, but it will give you incomplete characters, as you could have many c̶̼̟̏ó̷̘̉n̴̖̞̏̇t̸̡̃ĭ̸̻̬n̴̯͉̂͑ṵ̴̑a̷̛̫̳ẗ̸͕́i̷̱̫̓̋ǫ̸̑ǹ̶̼̅s̸̩̾̌ to represent what visually are a single char.

[1]: https://doc.rust-lang.org/std/string/struct.String.html#meth...

[2]: https://doc.rust-lang.org/std/str/struct.Chars.html

[3]: https://doc.rust-lang.org/std/primitive.char.html

moosingin3space · on Jan 10, 2019

> visually are a single char

IIRC that's what grapheme clusters are for.

Skunkleton · on Jan 9, 2019

UTF-8 is at odds with efficient array indexing. I like pythons approach where bytes and strings are distinct types, though I have no idea what it is doing under the hood.

int_19h · on Jan 10, 2019

Modern Python uses whatever representation is sufficient to ensure 1-unit-per-codepoint for a given string (which it can do on creation, since strings are immutable). So you get ASCII, UTF-16 sans surrogate pairs, or UTF-32.

This is great for high-level code, but painful to work with from native code, because it usually needs some specific encoding to call into other libraries, and it's usually UTF-8 - so you need to re-encode all the time.

colatkinson · on Jan 10, 2019

I actually had to work with Python strings at the C level recently, and their approach is pretty clever. IIRC, the runtime can take any common form of Unicode, and will store it. When you access that string, the accessor requests a specific encoding, and the runtime will convert if need be, and then store it in the string object.

So it handles the (very) common case of needing the same encoding multiple times (e.g. for all file paths on Windows), while not introducing too much overhead in memory or speed.

I could be mistaken on exact details though, especially since I recall there being multiple implementations even within py3.x.

Skunkleton · on Jan 11, 2019

Any idea how it handles indexing? Does it convert everything to 32 bit chars and ignore graphemes?

hu3 · on Jan 9, 2019

Go allows slicing UTF8 strings just fine: https://play.golang.org/p/eUQ5L58KwZy