This is a static archive of our old OpenStreetMap Help Site. Please post any new questions and answers at community.osm.org.

Why is the PBF StringTable defined to use byte

The protocol buffer specified for StringTable states that it stores repeated bytes.

message StringTable {
   repeated bytes s = 1;
}

To the best of my knowledge this could easily be stated as

message StringTable {
   repeated string s = 1;
}

As ProtocolBuffer already defines string to be a utf-8 or equivalent ASCII subset. In its current state I don't see any definition on how strings should be encoded (in PBF), and that is very bad.

pbf encoding

asked 26 Jul '13, 06:32

he_the_great
1.2k●6●14●23
accept rate: 14%

edited 26 Jul '13, 06:33

This is a very technical question and you are more likely to hear answers to this on the dev list (lists.openstreetmap.org/listinfo/dev).

(26 Jul '13, 08:49) Frederik Ramm ♦

One Answer:

From the Protobuf documentation it looks like "bytes" and "string" are treated almost the same everywhere. Internally they seem to be the same, only when setting or getting the data, there might be differences depending on the language used, because some languages have special types for UTF-8 strings. Using Protobuf from C++ there is no difference between these two types.

I don't know what the original reason was, maybe to optimize away any UTF-8 validity check that the library might do internally. But I don't really see any difficulties. All strings in OSM are always UTF-8, so that's what those are, too. If that's not documented it should be.

answered 28 Jul '13, 08:01

Jochen Topf
5.2k●5●50●74
accept rate: 31%