rfc:curl-url-api

PHP RFC: New Curl URL API

Introduction

Since version 7.62.0 of libcurl, 1 the library features a brand new URL API 2 that can be used to parse and generate URLs, using libcurl’s own parser. One of the goals of this API is to tighten a problematic vulnerable area for applications where the URL parser library would believe one thing and libcurl another. This could and has sometimes led to security problems. 3

Proposal

There are obviously many different ways this API could be implemented in userland, and recent discussions showed that there is not yet a consensus on how it should be done. The actual state of ext/curl lib is that all existing functions are thin wrapper over libcurl. The proposal is to add the new Curl Url API keeping it consistant with other curl_ functions. This implementation is a simple one-to-one binding of the libcurl functions. The underlying CURLU handle will be exposed as an opaque CurlUrl object.

The implementations would add 3 new functions curl_url(), curl_url_set() and curl_url_get() that will be used to manipulate the CurlUrl object/handle.

All CURLUPART_* and CURLU_* constants will be exposed as global constants with the same name in user land.

One new Curl option will also be available: CURLOPT_CURLU. Curl will use the given object and will not change its contents.

Those classes and functions will only exist if the version of libcurl installed on the system is greater than or equal to 7.62. If the version is older, they will not exist.

const CURLUPART_FRAGMENT = UNKNOWN;
const CURLUPART_HOST = UNKNOWN;
const CURLUPART_OPTIONS = UNKNOWN;
const CURLUPART_PASSWORD = UNKNOWN;
const CURLUPART_PATH = UNKNOWN;
const CURLUPART_PORT = UNKNOWN;
const CURLUPART_QUERY = UNKNOWN;
const CURLUPART_SCHEME = UNKNOWN;
const CURLUPART_URL = UNKNOWN;
const CURLUPART_USER = UNKNOWN;
 
const CURLU_APPENDQUERY = UNKNOWN;
const CURLU_DEFAULT_PORT = UNKNOWN;
const CURLU_DEFAULT_SCHEME = UNKNOWN;
const CURLU_DISALLOW_USER = UNKNOWN;
const CURLU_GUESS_SCHEME = UNKNOWN;
const CURLU_NO_DEFAULT_PORT = UNKNOWN;
const CURLU_NON_SUPPORT_SCHEME = UNKNOWN;
const CURLU_PATH_AS_IS = UNKNOWN;
const CURLU_URLDECODE = UNKNOWN;
const CURLU_URLENCODE = UNKNOWN;
 
/* libcurl >= 7.65.0 */
const CURLUPART_ZONEID = UNKNOWN;
 
/* libcurl >= 7.67.0 */
const CURLU_NO_AUTHORITY = UNKNOWN;
 
/* libcurl >= 7.78.0 */
const CURLU_ALLOW_SPACE = UNKNOWN;
 
function curl_url(?string $url = null): CurlUrl {}
function curl_url_set(CurlUrl $url, int $part, ?string $content, int $flags = 0): void {}
function curl_url_get(CurlUrl $url, int $part, int $flags = 0): ?string {}
 
final class CurlUrl {
    public function __clone() {}
}

curl_url(?string $url = null)

Create a new CurlUrl object. If $url is set, the object will be initialized using this URL, otherwise, all the parts will be set to null

All errors of libcurl will become CurlUrlException.

curl_url_set(CurlUrl $url, int $part, ?string $content, int $flags = 0): void

Update individual pieces of the URL. The $part argument identifies the particular URL part to set or change (CURLUPART_*). Setting a part to a null value will effectively remove that part's contents from the CurlUrl object.

The $flags argument is a bitmask with individual features.

All errors of libcurl will become CurlUrlException.

Supported flags Description
CURLU_NON_SUPPORT_SCHEME If set, allows this function to set a non-supported scheme.
CURLU_URLENCODE If set, URL encodes the part.
CURLU_DEFAULT_SCHEME If set, allows the URL to be set without a scheme, in which case the scheme will be set to the default: HTTPS. Overrides the CURLU_GUESS_SCHEME option if both are set.
CURLU_GUESS_SCHEME If set, allows the URL to be set without a scheme and it instead “guesses” which scheme was intended based on the host name. If the outermost sub-domain name matches DICT, FTP, IMAP, LDAP, POP3 or SMTP then that scheme will be used; otherwise it picks HTTP. Conflicts with the CURLU_DEFAULT_SCHEME option which takes precedence if both are set.
CURLU_NO_AUTHORITY If set, skips authority checks. The RFC allows individual schemes to omit the host part (normally the only mandatory part of the authority), but libcurl cannot know whether this is permitted for custom schemes. Specifying the flag permits empty authority sections, similar to how file scheme is handled.
CURLU_PATH_AS_IS If set, makes libcurl skip the normalization of the path. That is the procedure where curl otherwise removes sequences of dot-slash and dot-dot etc.
CURLU_ALLOW_SPACE If set, the URL parser allows space (ASCII 32) where possible. The URL syntax normally does not allow spaces anywhere, but they should be encoded as %20 or '+'. When spaces are allowed, they are still not allowed in the scheme. When space is used and allowed in a URL, it will be stored as-is unless CURLU_URLENCODE is also set.

curl_url_get(CurlUrl $url, int $part, int $flags = 0): ?string

This function lets the user extract individual pieces from the $url object. If the particular part is not set, this function will return null, all other errors of libcurl will become CurlUrlException.

The $part argument identifies the particular URL part to extract.

The $flags argument is a bitmask with individual features.

Supported flags Description
CURLU_DEFAULT_PORT If the object has no port stored, this option will make the function return the default port for the used scheme.
CURLU_DEFAULT_SCHEME If the object has no scheme stored, this option will make the function return the default scheme instead of null.
CURLU_NO_DEFAULT_PORT Instructs the function to not return a port number if it matches the default port for the scheme.
CURLU_URLDECODE If set, the function will encode the host name part. If not set (default), libcurl returns the URL with the host name “raw” to support IDN names to appear as-is. IDN host names are typically using non-ASCII bytes that otherwise will be percent-encoded. Note that even when not asking for URL encoding, the '%' (byte 37) will be URL encoded to make sure the host name remains valid.
CURLU_URLENCODE If set, the function will decode the host name part. If there are any byte values lower than 32 in the decoded string, the get operation will return an error instead.

CurlUrlException

The CurlUrlException class represents an error raised by libcurl. The constants exposed in this class are all the codes that CurlUrlException::getCode() could return. Those codes are internally mapped to CURLUE_* error codes that libcurl could raise. Those constants may vary depending on the version of libcurl ext/curl was compiled with.

If ext/curl was compiled with libcurl > 7.80 then CurlUrlException::getMessage() will return a user-friendly message that will describe the problem. (Example: Malformed input to a URL function).

/* libcurl >= 7.62.0 */
final class CurlUrlException extends Exception
{
    public const BAD_PORT_NUMBER = UNKNOWN;
    public const MALFORMED_INPUT = UNKNOWN;
    public const OUT_OF_MEMORY = UNKNOWN;
    public const UNSUPPORTED_SCHEME = UNKNOWN;
    public const URL_DECODING_FAILED = UNKNOWN;
    public const USER_NOT_ALLOWED = UNKNOWN;
 
    /* libcurl >= 7.81.0 */
    public const BAD_FILE_URL = UNKNOWN;
    public const BAD_FRAGMENT = UNKNOWN;
    public const BAD_HOSTNAME = UNKNOWN;
    public const BAD_IPV6 = UNKNOWN;
    public const BAD_LOGIN = UNKNOWN;
    public const BAD_PASSWORD = UNKNOWN;
    public const BAD_PATH = UNKNOWN;
    public const BAD_QUERY = UNKNOWN;
    public const BAD_SCHEME = UNKNOWN;
    public const BAD_SLASHES = UNKNOWN;
    public const BAD_USER = UNKNOWN;
}

Why not an OO API ?

With the very short delay we have before 8.2 feature freeze. It's safer to keep all the curl extension API consistant and not rushing a new OO design which obviously nobody agrees on as all the discussions showed that we are not even close to getting a consensus.

Future Scope

A better OO API could be discussed and implemented in next PHP versions.

Backward Incompatible Changes

None, except that the new class and function names will be declared by PHP and conflict with applications declaring the same class names in the global namespace.

Proposed PHP Version(s)

8.2

Vote

Voting opened on 2022-07-04 and closes on 2022-07-19

Add proposed new functional Curl URL API
Real name Yes No
asgrim (asgrim)  
bwoebi (bwoebi)  
cmb (cmb)  
crell (crell)  
cschneid (cschneid)  
dharman (dharman)  
evvc (evvc)  
galvao (galvao)  
imsop (imsop)  
kalle (kalle)  
kguest (kguest)  
levim (levim)  
lufei (lufei)  
marandall (marandall)  
mbeccati (mbeccati)  
nicolasgrekas (nicolasgrekas)  
ocramius (ocramius)  
pierrick (pierrick)  
santiagolizardo (santiagolizardo)  
sergey (sergey)  
svpernova09 (svpernova09)  
timwolla (timwolla)  
twosee (twosee)  
weierophinney (weierophinney)  
Final result: 10 14
This poll has been closed.

Patches and Tests

Not yet available.

Implementation

N/A

References

rfc/curl-url-api.txt · Last modified: 2022/07/19 16:43 by pierrick