Here are some notes from the meeting yesterday. I'm typing them up while listening to a talk, so I hope they make sense.
The plan is to avoid conversions if we can.
We'll introduce a PDO::ATTR_CHARSET attribute with a handful of pre-defined (integer) constant values:
It would also be nice to be able to pass in an IANA charset encoding name and have that work, but we needn't have that ready in time for the preview. The user-facing interface for that would be to pass the IANA name string in instead of a PDO::CHARSET_XXX constant.
The charset attribute will specifiy the default disposition for data going in and coming out from the database. It is the encoding that the script prefers to work with. This is stored in the dbh.
The driver will take this value and attempt to set the server connection to use that character set, so that the data that is returned matches the expectation and eliminates the need to perform an explicit conversion.
When a dataset is fetched and the columns described, each column will have some encoding information; this has to be done on a per column level because some databases have a notion of unicode encoded fields while the rest are, say, latin-1. When PDO gets the columns, it matches up the per-column encoding with the PDO::ATTR_CHARSET and will convert the data to match PDO::ATTR_CHARSET as needed. Ideally no conversion will be required.
The doer and preparer driver methods need to be expanded to accept a parameter that specifies whether the input is 8bit (binary or the PDO::ATTR_CHARSET for non utf16) or utf16. This parameter needs to be propagated to the query parser, which needs to be expanded to understand utf16. To reduce maintenance (don't really want two almost identical copies of the re2c in there) the “easy” approach would be to make it understand utf16 only and convert input to utf16 for the parser run. I think the cost of conversion here will be too small to notice compared to the cost of doing the query and waiting for the results, but I'm happy to be proved wrong by benchmarks once we have the preview release.
When passing parameters into the driver, it will be responsible for handling any conversions that might be required to pass that data to the server. String data will be assumed to be in the PDO::ATTR_CHARSET form, unless it is a unicode zval string being passed, in which case we know that it is utf16.
Let's think on this for a couple of days before we specify exactly where these bits fit in the structures and so on.
--Wez.