PHP RFC: Add validation functions to filter module

Version: 1.0
Date: 2016-08-03
Author: Yasuo Ohgaki yohgaki@ohgaki.net
Status: Declined
First Published at: http://wiki.php.net/rfc/add_validate_functions_to_filter

Introduction

Input data validation is the most important security measure in software security.

These recommends input validation by whitelist and accept only valid one.

We have filter module for this purpose, but it has problems

Functions are designed to “filter/convert value and accept” basically even with validation filter. (It accepts invalid also)
- Allow undefined input (empty element in input array) by default.
- They converts invalid values to NULL/FALSE, URL filter converts inputs to lower case. These behavior makes it difficult to identify input validation error. This is especially a problem with array value validation. e.g. “$orig_array == $check_array” cannot be done.
- Input validation must pass under normal condition. Input data validation error should result in exception error by default, but they are not.
String validation filter is missing even if string is the most dangerous input.
Multiple validation filters are not allowed. There are cases that we would like to use multiple filters, especially for strings. e.g. Check string length and encoding, then perform regex check.

Even if filter module is for better security, there are users misuse the module and result in bad consequences.

This RFC encourage users to proper secure coding by adding more suitable functions and filter for input validations.

Secure coding basics

A fundamental idea of secure coding is input and output control. Proposed new functions are supposed to use for input data validations, not for input error check in business logic.

Input data validation is better to think as input data assertion which should never fail under normal circumstances. Nature of input validation is differ from wrong input data handling from users which would happens normal conditions. User input mistakes, logically inconsistent data, e.g. date is past date for reservations, should not handled by input data validation part in general, but in business logic.

NOTE: Input data validation is runtime assertion should never fail under normal circumstances. It should not handle user input mistakes nor logical inconsistencies.

WARNING: Input and output handling is independent. Output code is responsible to make sure output is safe for external computer/software. Output data should be safe regardless of input validation. i.e. Programmer must escape/use secure API/validate (or Make sure 120% safety of the data) all output data always.

What programmers should do for input validation(assertion) is to accept only valid inputs. Followings are blacklist of inputs, but programmers must think of whitelist (what's to accept, reject anything else), NOT blacklist, when they write input validation code.

Broken char encoding (Accept only valid encoding)
NUL, etc control chars in string. (Accept only chars allowed)
Too long or too short string. e.g. JS validated values and values set by server programs like <select>/<input type=radio>/etc, 100 chars for username, 1000 chars for password, empty ID for a database record, etc. (Accept only strings within range)
Too large or too small numerics. i.e. Int/float/bool value (Accept only numeric within range)
Too many or too few inputs. (Accept only input that has expected number of inputs)
Broken number string for a database record ID. (Accept only valid string format)
Broken flags. i.e. Bool value (Accept only valid value for bool)
Newline chars in <input>, hash value, etc. (Accept only valid string format/chars)
Broken string/date format. e.g. JS validated phone number, list items such as country names, date string, etc. (Accept only valid string format)
and so on.

Not all of them can be validated at input validation. How/what input could be validated is depended on input source spec. For example, if you do client side validation in your system, you can validate strings strictly. e.g. Date string. If you don't do client side validation at all and using plain <input> for date, your validation code cannot do much. However, a string over 100 chars, string contains control char(s) or broken char encoding for date is good enough to be rejected as a invalid input.

Dividing input data validation and user input mistake handling in business logic makes software simpler and easier to maintain. Input data format is more stable than business logic by nature. e.g. Object interface is more stable than object implementation. Simplicity and maintainability is important for security also.

SUMMARY: Input data validation should accept only valid and possible inputs. If not, reject it and terminate program. There is no point to keep running program with invalid input data that cannot work correctly. Logic should take care of the rest that input validation cannot check.

The most important input validation is application level validation, but input validation is not limited to it.

Please refer to mentioned secure cording practices, Design by Contract(DbC) for more details. DbC requires proper runtime input validations. Proposed validation functions can be used for this purpose.

Proposal

Followings are filter module improvement proposals.

Add validation functions

Add filter_require_var_array()/filter_require_input_array()/filter_require_var()/filter_require_input()

array filter_require_var_array ( array $data , mixed $definition [, int $function_options ] )
mixed filter_require_var ( mixed $variable , int $filter [, mixed $options ] )
array filter_require_input_array ( int $type , mixed $definition [, int $function_options ] )
mixed filter_require_input ( int $type , string $variable_name , int $filter [, mixed $options ] )

They are almost the same as filter_var/input*() functions. Key differences compared to other filter_var/input*() functions are:

Raise FilterValidateException when they detect invalid input.
Requires to define filter. (Default must be set by user)
Conservative default. Empty element is not added by default. They do not trim spaces in int/float/bool like input data.
$function_options are FILTER_OPTS_ADD_EMPTY (Empty input to NULL) and/or FILTER_OPTS_DISABLE_EXCEPTION (Return FALSE rather than exception)

NOTE: Main motivation of adding these functions is “filter_var_array()/filter_input_array() is not suitable for strict input validation”. See Discussion section.

Add filter_check_definition() - Check definition array for filter_require_*_array()/filter_*_array()

bool filter_check_definition (array $definition_of_array_value_filter_and_validation)

Filter definition error is silently ignored for performance reason. i.e. It does not check possible typo and malformed elements. Definition error could be fatal bug. This function provides check feature finds typo, format error.

Limitations and Notes:

filter_check_definition() only checks format, not semantics. i.e. It does not check if options/flags are suitable for filter.
Callback filter can be used for validations, but it is user's responsibility to raise FilterValidateException when there is validation error.
filter_require_*() functions share filter_var/input*() function's validation filter. Therefore,
- filter_require_*() functions do not keep input data type. Data type is changed according filter used. i.e. INT/FLOAT/BOOL filters convert data type.
- FILTER_VALIDATE_INT/FILTER_VALIDATE_FLOAT/FILTER_VALIDATE_BOOLEAN validation DO NOT trim spaces and converts to int/float/bool type. Spaces raise exception.
- FILTER_VALIDATE_INT validation converts base 10, base 8(FILTER_FLAG_ALLOW_OCTAL), base 16(FILTER_FLAG_ALLOW_HEX) integer values to int type. In addition, it detects overflow, so be careful when your program must run nicely on both 32 and 64 bit architecture. NOTE: One must not use FILTER_VALIDATE_INT for database record ID validation. Use string validation filter and FILTER_VALIDATE_STRING_NUM.
- FILTER_VALIDATE_BOOLEAN validation converts 1/true/yes/on(case insensitive) to TRUE, 0/false/no/off(case insensitive) to FALSE.
- FILTER_VALIDATE_BOOLEAN does NOT allow empty to FALSE conversion. Use FILTER_FLAG_BOOL_ALLOW_EMPTY to achieve filter_var/input*() like behavior.
- Data type conversions is good for 'declare(strict_types=1)' ZendEngine switch, so it is retained.
Since Exception terminates execution where it is raised, return value from filter_require_*() function is not usable when validation exception is raised. See the example code in “Allow multiple filters for an input” section.

Allow multiple filters for an input

Example is easier to understand. New filter module allows multiple filters for both validation/sanitize filters.

<?php
// Following initialization is to illustrate exception handling.
$myinput = array(
    'some' => 'inputs like $_GET, $_POST, $_COOKIE, $_FILES',
    'or' => 'could be return value from previous validation',
);
 
// Reusable input element validation rule.
// NOTE: This has nonsense rule for usage illustration purpose.
$date_spec =
    array(
        // New filter module allows multiple filters and options as follows.
        // Array elements are evaluated in order. Non array spec is evaluated last.
        // Older implementation ignores this kind of spec silently.
        array( // This is evaluated first.
            'filter'    => FILTER_VALIDATE_STRING,
            'options'   => array('min_bytes' => 10, 'max_bytes' => 10, 'encoding' => FILTER_STRING_ENCODING_PASS)
        ),
        array(
            'filter' => FILTER_VALIDATE_REGEXP,
            'options' => array('regexp' => '/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/')
        ),
        array(
            'filter' => FILTER_VALIDATE_CALLBAK,
            'options' => array('callback' => 'check_date_and_raise_exception_for_invalid()'),
        ),
        'filter' => FILTER_UNSAFE_RAW, // Evaluated last. Does nothing. It's here for an example.
    );
 
 
$definitions = array(
    'date'    => $date_spec,
    'time'    => $time_spec, // Other than 'date' spec, element spec definition is omitted in this example
    'id'      => $id_spec,
    'isbn'    => $isbn_spec,
    'phone'   => $phone_spec,
    'zipcode' => $zipcode_spec,
    'country' => $country_spec,
    'age'     => $age_spec,
    'first_name' => $name_spec,
    'last_name' => $name_spec,
    'address' => $address_spec,
    'filename'=> $filename_spec,
    // and so on
);
 
// Throws FilterValidateException for invalid inputs.
try {
    $myinputs = filter_require_var_array($data, $definitions);
    // NOTE: If you need returned array value, it MUST be inside try block
    //       or catch block MUST terminate execution. Otherwise, returned value
    //       may contain irrelevant values.
    var_dump($myinputs);
} catch (FilterValidateException $e) {
    var_dump($e->getMessage());
    die('Invalid input detected!'); // Should terminate execution when input validation fails
}
// If validation exception is raised and catch block didn't terminate script,
// $myinputs will have irrelevant value from previous initialization.
// WARNING: When validation exception is raised, program MUST NOT reach here.
// If you properly handle validation exceptions, i.e. terminate execution,
// then you can use $myinputs safely outside of try block.
var_dump($myinputs);

Add string validation filter

Add missing string validation filter (FILTER_VALIDATE_STRING). This filter has conservative default. i.e. Strict validation by default.

Features:

Validate string as UTF-8 by default. (Only UTF-8 is supported)
FILTER_STRING_ENCODING_PASS 'encoding' option to disable encoding check.
'min_bytes'/'max_bytes' options for string length. min_bytes default is 2, max_bytes default is 20.
'allowed_chars' option can specify allowed chars. (Only works for code value less than 127)
Single line is allowed by default.
FILTER_FLAG_STRING_ALLOW_NEWLINE flag to allow multi line (\r, \n) inputs.
FILTER_FLAG_STRING_ALLOW_TAB to allow TAB.
FILTER_FLAG_STRING_ALLOW_CNTRL to allow control chars.
FILTER_FLAG_STRING_ALPHA to allow only alphabet
FILTER_FLAG_STRING_NUM to allow only number(digit)
FILTER_FLAG_STRING_ALNUM to allow only alphanumeric

Limitations:

UTF-8 only.
Chars control is limited code less than 127. (Only ASCII chars)

Other changes in validation filter

NOTE: These changes are only applicable when new filter_require*() functions are used

FILTER_VALIDATE_INT/FILTER_VALIDATE_FLOAT/FILTER_VALIDATE_BOOLEAN filters do NOT trim spaces.
All validation filters raise FilterValidateException for validation errors.

Discussions

Why it should be in core?

There are users who misuse current filter module for “secure coding” input validations.

Input validation is the most important security feature. PHP should provide easy to use/reliable/fast input validation feature. We should encourage strict input validation that rejects invalid(attacker) inputs by having stricter input validation features rather than filter(convert) and accept.

This proposal reduces filter module misuse which is built always by default.

Why not compare filter_var_array() result?

Following code may seem to work, but it would not.

$ret = filter_var_array($arr, $validation_spec);
if ($ret != $arr) {
  die('Input does not validate');
}

One should never compare float equality. (Float string is converted to float type. Think of huge string value and result of float converted value comparison.)
They are filter(conversion) functions. e.g. URLs are converted to lowercase.
It allows empty input by default and add NULL element.
int/float/bool validation filters trim and convert type. (They cannot match by “==” comparison)

For these reasons, comparing original and return(filtered) value is not suitable for strict input validation.

Framework should do this task

There are several reasons doing this by PHP itself.

Current validation filters and filter functions are not suitable for input data validation. (Even misleading)
Encourage users to do secure coding by having proper feature. i.e. Validate and accept only valid inputs.
Simple apps should be able to be written by PHP's basic feature. i.e. Input data validation is mandatory for secure coding.
This RFC makes easy to introduce input data validation for any PHP apps. i.e. There are many framework less codes and/or apps built with micro/light framework w/o input validation feature.
It's fast. i.e. Simple array is used for validation spec definition = fast.
This kind of feature is required for DbC. i.e. https://wiki.php.net/rfc/introduce_design_by_contract

Frameworks may implement their own validators with more features, but PHP should have its own usable validator because this feature is mandatory.

PHP is a first choice for Web development because PHP can write simple web apps by simple codes. PHP should try to keep this aspect as much as possible, and try to provide mandatory and/or best practice features. Otherwise, PHP would not make much difference to other languages that require Web application frameworks even for a simple web apps.

Input validation and User input mistake handling difference

Although following reply is long, but it's worthwhile mentioning here.

Hi Stas,

On Mon, Aug 15, 2016 at 2:17 PM, Stanislav Malyshev <smalyshev@gmail.com> wrote:
>> It seems there is misunderstanding.
>> These new functions are intended for "secure coding input validation" that
>> should never fail. It means something unexpected in input data that
>> cannot/shouldn't keep program running. Why do you need to parse
>> message?
>
> I think the problem here is as follows: assume you accept use input. You
> want it to conform to some set of rules. If it does not, you may want to
> inform the user that the input is wrong, in an informative way. Now, if
> you say these functions "should never fail", it implies that before
> them, there would be other functions filtering user input (because user
> input could always violate whatever rules you'd have) - and then the
> question is, would you really want *two* sets of validators? You'd
> probably want one.
> Now, when you have one, you probably want it to validate the data and
> return some information that would be useful for informing the user what
> has gone wrong. That seems to be the issue here.
> I do think having strong input validation is a good thing. However, we'd
> also need to have them in a way that would make them useful in above
> scenario - otherwise people would avoid them because they fail "too
> hard" and the app does not retain enough control over the outcome.

I think this discussion relates to following questions.
I'll try to explain there.

>
>> There is misunderstanding on this.
>> As I wrote explicitly in the RFC, input validation and user input
>> mistakes must be handled differently.
>>
>> "The input validation (or think it as assertion or requirement) error"
>> that this RFC is dealing, is should never happen conditions (or think
>> it as contract should never fail).
>
> This is what I'm not sure I understand - when this approach would be
> used? I.e. if I get data from the user, I surely can not claim I can
> impose any conditions on the data that would never fail. Is it assumed
> I'd pre-filter the data before passing it to this filter?

How and what rules could be imposed to inputs varies depending on
what kind of data should be sent from outsides of a software including
human users.

Let's say your app validate user written/chosen "Date" on client side by
JavaScript. Then browser must send whatever "Date" format you impose
to client. It may be "YYYYMMDD", for example.

Then programer should not accept "Date" format other than "YYYYMMDD"
because other format is invalid. Accepting format other than "YYYYMMDD"
does only bad and increase risks of program malfunctioning. i.e. All kinds
of injections like JavaScript, SQL, Null char, Newline, etc.

The basic idea of secure coding input validation is to remove all unnecessary
security risks at "Input Validation".

Even when "Date" field is plain <input> that user can write any chars,
Null char, CR/LF, TAB or any CNTRL chars should not be in there. There will
be no users type in 100 chars for "Date" field unless they were trying to tamper
application.

"Input validation" should reject all of them and does not have to inform users
(attackers) to "there is invalid input". If you need to tell  legitimate users
"There is invalid input", then it should be treated by "Business logic", not by
"Input validation".

>
>> The point of having the input validation is accept only inputs that
>> program expects and can work correctly. Accepting unexpected
>> data that program cannot work correctly is pointless.
>
> Well, that depends on what you mean by "accepting". The program should
> exhibit sane behavior (i.e., useful error message, not whitescreen or
> something like that) on bad input. That behavior can be different -
> i.e., if you are given wrong password, you shouldn't be too helpful and
> say "this password is wrong, the right password is this: ...." (you'd
> laugh but there *was* a real application doing this, no, I have no idea
> what the developers were thinking :) but at least you could say
> "authentication details are wrong".

User authentication could do the similar to "Date" field for "User name"
and "Password".

"User name" and "Password" shouldn't have CNTRL chars or invalid char
encoding. Even when fields are plain <input>, there shouldn't be 500 chars
long inputs for them.

Anything else for "User name" and "Password" should be handled by
"Business logic". Logic part should display nice and proper error messages
like

 - User name is too long for 100 chars name.
 - Password is too long for 100 chars password.
 - User name and/or Password is wrong and failed to authenticate.

>> Don't misunderstood me. I'm not saying "You should reject user input
>> mistakes".
>> "User input mistakes" and "input validation error" is totally different
>> error.
>
> Here, again, I am not sure I understand the difference.

The reason why I propose to divide input error checks into "Input validation"
and "Business logic" is for simplicity and maintainability.

"Input validation" should be done not only for human entered inputs, but
also automatically generated inputs by system.

Generally speaking, developers should not accept request that has

Invalid browser headers:
 - Invalid REFERER contains Illegal/CTNRL chars and/or too many chars.
 - Invalid ACCEPT-CHARSET contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-ENCODING contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-LANGUAGE contains Illegal/CNTRL chars and/or too many chars.
 - and so on.

Invalid POST/GET request:
 - Lacks required field by your program. e.g. If you set CSRF token
for POST always, but it's missing.
 - Multi page form inputs and lack/have invalid data that should have
been validated previously. Note: there is design choice for this
where/how to deal with invalid inputs.
 - Program written data is invalid. e.g.
//php.net/show_bug.php?id=[string contains CNTRL chars and/or 100
chars or more]
 - $_POST/$_GET has more than 20 elements. Note: most apps/code would
not have this many elements.

Invalid COOKIE:
 - $_COOKIE has more than 20 elements. Note: normal apps would not
have this many cookies.
 - Lacks required field by your program.
 - Invalid chars. e.g. CNTRL chars.

All of these have history of abuse by attackers and programs should not
accept them. Please note that secure coding requires to output
securely. Input validation and output sanitization should be treated
as individual task. e.g. Escape all variables at "Output" code when
you output something to other software. Never assume, "This var is
validated at input, so it is safe without escaping."

It's developer's choice how to validate inputs, e.g. they don't use
"CONNECTION" HTTP header at all and don't care, but all of secure
coding related guides that I know of recommends/requires to validate
"all inputs".

Validating all inputs that are irrelevant to "Business logic" makes
programs complicated and hard to maintain. Broken char encoding, too
long/short, CNTRL chars for <form> inputs are better to handled by
"Input validation" because the same thing might be done by different
<form>s repeatedly.

There are many possibility for software design. This RFC is designed
to encourage to do certain validation. However, this RFC does not
impose developers to do certain validation, but provides tools that
are needed for validations.

I would not encourage users to disable exception from
filter_require_var()/filter_require_var_array(), but I've changed them
not to raise exception optionally as a last minute change. This allows
developers to use new validator for wider purposes.

Regards,

P.S. I'll extend vote period because there is ongoing discussion.

BTW, ISO 27000/ISMS requires/recommends proposed input validation.
Latest ISO 27000 mentioned as "adopt secure programming". Older
ISO 27000 explained how to validate inputs. New ISO 27000 removed
detailed input validation method explanation because secure programming
is widely adopted and standardized.

--
Yasuo Ohgaki
yohgaki@ohgaki.net

Backward Incompatible Changes

None. filter_var/input*() functions are not changed at all.

Proposed PHP Version(s)

7.1.0 or 7.2.0

RFC Impact

To SAPIs

None

To Existing Extensions

None

To Opcache

None

New Constants

String validation filter flags

FILTER_STRING_ENCODING_PASS - string validation filter encoding do not perform encoding check.
FILTER_STRING_ENCODING_UTF8 - string validation filter encoding (Default)
FILTER_FLAG_STRING_RAW - string validation filter flag for binary.
FILTER_FLAG_STRING_ALLOW_CNTRL - string validation filter flag allows all CNTRL chars
FILTER_FLAG_STRING_ALLOW_TAB - string validation filter flag allows TAB
FILTER_FLAG_STRING_ALLOW_NEWLINE - string validation filter flag allows newlines (\n,\r)
FILTER_FLAG_STRING_ALPHA - string validation filter flag allows alphabet only.
FILTER_FLAG_STRING_NUM - string validation filter flag allows digit only
FILTER_FLAG_STRING_ALNUM - string validation filter flag allows alphabet and digit only.

Bool validation filter flags (filter_var/input*() functions are not affected. It allow empty always)

FILTER_FLAG_BOOL_ALLOW_EMPTY - bool validation flag allows empty string to FALSE

Function behavior options

FILTER_OPTS_ADD_EMPTY - Add NULL for empty element when validation rule is defined for the element.
FILTER_OPTS_DISABLE_EXCEPTION - Disable exception and return FALSE when validation error detected.

php.ini Defaults

No changes

Open Issues

None

Unaffected PHP Functionality

Existing filter features are not changed at all.

Future Scope

Refactor code. Code is not refactored to minimize changes.
Add “optional filter” that allows optional input. Optional filter could be written by “callback filter” with this RFC.

Proposed Voting Choices

This project requires a 2/3 majority

Add validation functions to filter module
Real name	Yes	No
bwoebi (bwoebi)
colinodell (colinodell)
danack (danack)
derick (derick)
diegopires (diegopires)
guilhermeblanco (guilhermeblanco)
kguest (kguest)
levim (levim)
lstrojny (lstrojny)
marcio (marcio)
nikic (nikic)
ocramius (ocramius)
peehaa (peehaa)
santiagolizardo (santiagolizardo)
yohgaki (yohgaki)
Final result:	1	14
This poll has been closed.

Please choose targeted version for this RFC

Target version
Real name	7.1.0	7.2.0
colinodell (colinodell)
derick (derick)
kguest (kguest)
levim (levim)
ocramius (ocramius)
yohgaki (yohgaki)
Final result:	1	5
This poll has been closed.

Vote start 2016/08/15, ends ~~2016/08/22 23:59:59 UTC~~ 2016/08/29 23:59:59 UTC

Patches and Tests

https://github.com/php/php-src/pull/2048

Implementation

After the project is implemented, this section should contain

the version(s) it was merged to
a link to the git commit(s)
a link to the PHP manual entry for the feature

References

Links to external references, discussions or RFCs

DbC/Design by Contract

Rejected Features

Keep this updated with features that were discussed on the mail lists.

Table of Contents

PHP RFC: Add validation functions to filter module

Introduction

Secure coding basics

Proposal

Add validation functions

Allow multiple filters for an input

Add string validation filter

Other changes in validation filter

Discussions

Why it should be in core?

Why not compare filter_var_array() result?

Framework should do this task

Input validation and User input mistake handling difference

Backward Incompatible Changes

Proposed PHP Version(s)

RFC Impact

To SAPIs

To Existing Extensions

To Opcache

New Constants

php.ini Defaults

Open Issues

Unaffected PHP Functionality

Future Scope

Proposed Voting Choices

Patches and Tests

Implementation

References

Rejected Features