Table of Contents

PHP RFC: Add validation functions to filter module

Introduction

Input data validation is the most important security measure in software security.

These recommends input validation by whitelist and accept only valid one.

We have filter module for this purpose, but it has problems

Even if filter module is for better security, there are users misuse the module and result in bad consequences.

This RFC encourage users to proper secure coding by adding more suitable functions and filter for input validations.

Secure coding basics

A fundamental idea of secure coding is input and output control. Proposed new functions are supposed to use for input data validations, not for input error check in business logic.

Input data validation is better to think as input data assertion which should never fail under normal circumstances. Nature of input validation is differ from wrong input data handling from users which would happens normal conditions. User input mistakes, logically inconsistent data, e.g. date is past date for reservations, should not handled by input data validation part in general, but in business logic.

NOTE: Input data validation is runtime assertion should never fail under normal circumstances. It should not handle user input mistakes nor logical inconsistencies.

WARNING: Input and output handling is independent. Output code is responsible to make sure output is safe for external computer/software. Output data should be safe regardless of input validation. i.e. Programmer must escape/use secure API/validate (or Make sure 120% safety of the data) all output data always.

What programmers should do for input validation(assertion) is to accept only valid inputs. Followings are blacklist of inputs, but programmers must think of whitelist (what's to accept, reject anything else), NOT blacklist, when they write input validation code.

Not all of them can be validated at input validation. How/what input could be validated is depended on input source spec. For example, if you do client side validation in your system, you can validate strings strictly. e.g. Date string. If you don't do client side validation at all and using plain <input> for date, your validation code cannot do much. However, a string over 100 chars, string contains control char(s) or broken char encoding for date is good enough to be rejected as a invalid input.

Dividing input data validation and user input mistake handling in business logic makes software simpler and easier to maintain. Input data format is more stable than business logic by nature. e.g. Object interface is more stable than object implementation. Simplicity and maintainability is important for security also.

SUMMARY: Input data validation should accept only valid and possible inputs. If not, reject it and terminate program. There is no point to keep running program with invalid input data that cannot work correctly. Logic should take care of the rest that input validation cannot check.

The most important input validation is application level validation, but input validation is not limited to it.

Please refer to mentioned secure cording practices, Design by Contract(DbC) for more details. DbC requires proper runtime input validations. Proposed validation functions can be used for this purpose.

Proposal

Followings are filter module improvement proposals.

Add validation functions

array filter_require_var_array ( array $data , mixed $definition [, int $function_options ] )
mixed filter_require_var ( mixed $variable , int $filter [, mixed $options ] )
array filter_require_input_array ( int $type , mixed $definition [, int $function_options ] )
mixed filter_require_input ( int $type , string $variable_name , int $filter [, mixed $options ] )

They are almost the same as filter_var/input*() functions. Key differences compared to other filter_var/input*() functions are:

NOTE: Main motivation of adding these functions is “filter_var_array()/filter_input_array() is not suitable for strict input validation”. See Discussion section.

bool filter_check_definition (array $definition_of_array_value_filter_and_validation)

Filter definition error is silently ignored for performance reason. i.e. It does not check possible typo and malformed elements. Definition error could be fatal bug. This function provides check feature finds typo, format error.

Limitations and Notes:

Allow multiple filters for an input

Example is easier to understand. New filter module allows multiple filters for both validation/sanitize filters.

<?php
// Following initialization is to illustrate exception handling.
$myinput = array(
    'some' => 'inputs like $_GET, $_POST, $_COOKIE, $_FILES',
    'or' => 'could be return value from previous validation',
);
 
// Reusable input element validation rule.
// NOTE: This has nonsense rule for usage illustration purpose.
$date_spec =
    array(
        // New filter module allows multiple filters and options as follows.
        // Array elements are evaluated in order. Non array spec is evaluated last.
        // Older implementation ignores this kind of spec silently.
        array( // This is evaluated first.
            'filter'    => FILTER_VALIDATE_STRING,
            'options'   => array('min_bytes' => 10, 'max_bytes' => 10, 'encoding' => FILTER_STRING_ENCODING_PASS)
        ),
        array(
            'filter' => FILTER_VALIDATE_REGEXP,
            'options' => array('regexp' => '/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/')
        ),
        array(
            'filter' => FILTER_VALIDATE_CALLBAK,
            'options' => array('callback' => 'check_date_and_raise_exception_for_invalid()'),
        ),
        'filter' => FILTER_UNSAFE_RAW, // Evaluated last. Does nothing. It's here for an example.
    );
 
 
$definitions = array(
    'date'    => $date_spec,
    'time'    => $time_spec, // Other than 'date' spec, element spec definition is omitted in this example
    'id'      => $id_spec,
    'isbn'    => $isbn_spec,
    'phone'   => $phone_spec,
    'zipcode' => $zipcode_spec,
    'country' => $country_spec,
    'age'     => $age_spec,
    'first_name' => $name_spec,
    'last_name' => $name_spec,
    'address' => $address_spec,
    'filename'=> $filename_spec,
    // and so on
);
 
// Throws FilterValidateException for invalid inputs.
try {
    $myinputs = filter_require_var_array($data, $definitions);
    // NOTE: If you need returned array value, it MUST be inside try block
    //       or catch block MUST terminate execution. Otherwise, returned value
    //       may contain irrelevant values.
    var_dump($myinputs);
} catch (FilterValidateException $e) {
    var_dump($e->getMessage());
    die('Invalid input detected!'); // Should terminate execution when input validation fails
}
// If validation exception is raised and catch block didn't terminate script,
// $myinputs will have irrelevant value from previous initialization.
// WARNING: When validation exception is raised, program MUST NOT reach here.
// If you properly handle validation exceptions, i.e. terminate execution,
// then you can use $myinputs safely outside of try block.
var_dump($myinputs);

Add string validation filter

Add missing string validation filter (FILTER_VALIDATE_STRING). This filter has conservative default. i.e. Strict validation by default.

Features:

Limitations:

Other changes in validation filter

NOTE: These changes are only applicable when new filter_require*() functions are used

Discussions

Why it should be in core?

There are users who misuse current filter module for “secure coding” input validations.

Input validation is the most important security feature. PHP should provide easy to use/reliable/fast input validation feature. We should encourage strict input validation that rejects invalid(attacker) inputs by having stricter input validation features rather than filter(convert) and accept.

This proposal reduces filter module misuse which is built always by default.

Why not compare filter_var_array() result?

Following code may seem to work, but it would not.

$ret = filter_var_array($arr, $validation_spec);
if ($ret != $arr) {
  die('Input does not validate');
}

For these reasons, comparing original and return(filtered) value is not suitable for strict input validation.

Framework should do this task

There are several reasons doing this by PHP itself.

  1. Current validation filters and filter functions are not suitable for input data validation. (Even misleading)
  2. Encourage users to do secure coding by having proper feature. i.e. Validate and accept only valid inputs.
  3. Simple apps should be able to be written by PHP's basic feature. i.e. Input data validation is mandatory for secure coding.
  4. This RFC makes easy to introduce input data validation for any PHP apps. i.e. There are many framework less codes and/or apps built with micro/light framework w/o input validation feature.
  5. It's fast. i.e. Simple array is used for validation spec definition = fast.
  6. This kind of feature is required for DbC. i.e. https://wiki.php.net/rfc/introduce_design_by_contract

Frameworks may implement their own validators with more features, but PHP should have its own usable validator because this feature is mandatory.

PHP is a first choice for Web development because PHP can write simple web apps by simple codes. PHP should try to keep this aspect as much as possible, and try to provide mandatory and/or best practice features. Otherwise, PHP would not make much difference to other languages that require Web application frameworks even for a simple web apps.

Input validation and User input mistake handling difference

Although following reply is long, but it's worthwhile mentioning here.

Hi Stas,

On Mon, Aug 15, 2016 at 2:17 PM, Stanislav Malyshev <smalyshev@gmail.com> wrote:
>> It seems there is misunderstanding.
>> These new functions are intended for "secure coding input validation" that
>> should never fail. It means something unexpected in input data that
>> cannot/shouldn't keep program running. Why do you need to parse
>> message?
>
> I think the problem here is as follows: assume you accept use input. You
> want it to conform to some set of rules. If it does not, you may want to
> inform the user that the input is wrong, in an informative way. Now, if
> you say these functions "should never fail", it implies that before
> them, there would be other functions filtering user input (because user
> input could always violate whatever rules you'd have) - and then the
> question is, would you really want *two* sets of validators? You'd
> probably want one.
> Now, when you have one, you probably want it to validate the data and
> return some information that would be useful for informing the user what
> has gone wrong. That seems to be the issue here.
> I do think having strong input validation is a good thing. However, we'd
> also need to have them in a way that would make them useful in above
> scenario - otherwise people would avoid them because they fail "too
> hard" and the app does not retain enough control over the outcome.

I think this discussion relates to following questions.
I'll try to explain there.

>
>> There is misunderstanding on this.
>> As I wrote explicitly in the RFC, input validation and user input
>> mistakes must be handled differently.
>>
>> "The input validation (or think it as assertion or requirement) error"
>> that this RFC is dealing, is should never happen conditions (or think
>> it as contract should never fail).
>
> This is what I'm not sure I understand - when this approach would be
> used? I.e. if I get data from the user, I surely can not claim I can
> impose any conditions on the data that would never fail. Is it assumed
> I'd pre-filter the data before passing it to this filter?

How and what rules could be imposed to inputs varies depending on
what kind of data should be sent from outsides of a software including
human users.

Let's say your app validate user written/chosen "Date" on client side by
JavaScript. Then browser must send whatever "Date" format you impose
to client. It may be "YYYYMMDD", for example.

Then programer should not accept "Date" format other than "YYYYMMDD"
because other format is invalid. Accepting format other than "YYYYMMDD"
does only bad and increase risks of program malfunctioning. i.e. All kinds
of injections like JavaScript, SQL, Null char, Newline, etc.

The basic idea of secure coding input validation is to remove all unnecessary
security risks at "Input Validation".

Even when "Date" field is plain <input> that user can write any chars,
Null char, CR/LF, TAB or any CNTRL chars should not be in there. There will
be no users type in 100 chars for "Date" field unless they were trying to tamper
application.

"Input validation" should reject all of them and does not have to inform users
(attackers) to "there is invalid input". If you need to tell  legitimate users
"There is invalid input", then it should be treated by "Business logic", not by
"Input validation".

>
>> The point of having the input validation is accept only inputs that
>> program expects and can work correctly. Accepting unexpected
>> data that program cannot work correctly is pointless.
>
> Well, that depends on what you mean by "accepting". The program should
> exhibit sane behavior (i.e., useful error message, not whitescreen or
> something like that) on bad input. That behavior can be different -
> i.e., if you are given wrong password, you shouldn't be too helpful and
> say "this password is wrong, the right password is this: ...." (you'd
> laugh but there *was* a real application doing this, no, I have no idea
> what the developers were thinking :) but at least you could say
> "authentication details are wrong".

User authentication could do the similar to "Date" field for "User name"
and "Password".

"User name" and "Password" shouldn't have CNTRL chars or invalid char
encoding. Even when fields are plain <input>, there shouldn't be 500 chars
long inputs for them.

Anything else for "User name" and "Password" should be handled by
"Business logic". Logic part should display nice and proper error messages
like

 - User name is too long for 100 chars name.
 - Password is too long for 100 chars password.
 - User name and/or Password is wrong and failed to authenticate.


>> Don't misunderstood me. I'm not saying "You should reject user input
>> mistakes".
>> "User input mistakes" and "input validation error" is totally different
>> error.
>
> Here, again, I am not sure I understand the difference.

The reason why I propose to divide input error checks into "Input validation"
and "Business logic" is for simplicity and maintainability.

"Input validation" should be done not only for human entered inputs, but
also automatically generated inputs by system.

Generally speaking, developers should not accept request that has

Invalid browser headers:
 - Invalid REFERER contains Illegal/CTNRL chars and/or too many chars.
 - Invalid ACCEPT-CHARSET contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-ENCODING contains Illegal/CNTRL chars and/or too many chars.
 - Invalid ACCEPT-LANGUAGE contains Illegal/CNTRL chars and/or too many chars.
 - and so on.

Invalid POST/GET request:
 - Lacks required field by your program. e.g. If you set CSRF token
for POST always, but it's missing.
 - Multi page form inputs and lack/have invalid data that should have
been validated previously. Note: there is design choice for this
where/how to deal with invalid inputs.
 - Program written data is invalid. e.g.
//php.net/show_bug.php?id=[string contains CNTRL chars and/or 100
chars or more]
 - $_POST/$_GET has more than 20 elements. Note: most apps/code would
not have this many elements.

Invalid COOKIE:
 - $_COOKIE has more than 20 elements. Note: normal apps would not
have this many cookies.
 - Lacks required field by your program.
 - Invalid chars. e.g. CNTRL chars.

All of these have history of abuse by attackers and programs should not
accept them. Please note that secure coding requires to output
securely. Input validation and output sanitization should be treated
as individual task. e.g. Escape all variables at "Output" code when
you output something to other software. Never assume, "This var is
validated at input, so it is safe without escaping."

It's developer's choice how to validate inputs, e.g. they don't use
"CONNECTION" HTTP header at all and don't care, but all of secure
coding related guides that I know of recommends/requires to validate
"all inputs".

Validating all inputs that are irrelevant to "Business logic" makes
programs complicated and hard to maintain. Broken char encoding, too
long/short, CNTRL chars for <form> inputs are better to handled by
"Input validation" because the same thing might be done by different
<form>s repeatedly.

There are many possibility for software design. This RFC is designed
to encourage to do certain validation. However, this RFC does not
impose developers to do certain validation, but provides tools that
are needed for validations.

I would not encourage users to disable exception from
filter_require_var()/filter_require_var_array(), but I've changed them
not to raise exception optionally as a last minute change. This allows
developers to use new validator for wider purposes.

Regards,

P.S. I'll extend vote period because there is ongoing discussion.

BTW, ISO 27000/ISMS requires/recommends proposed input validation.
Latest ISO 27000 mentioned as "adopt secure programming". Older
ISO 27000 explained how to validate inputs. New ISO 27000 removed
detailed input validation method explanation because secure programming
is widely adopted and standardized.

--
Yasuo Ohgaki
yohgaki@ohgaki.net

Backward Incompatible Changes

None. filter_var/input*() functions are not changed at all.

Proposed PHP Version(s)

7.1.0 or 7.2.0

RFC Impact

To SAPIs

None

To Existing Extensions

None

To Opcache

None

New Constants

String validation filter flags

Bool validation filter flags (filter_var/input*() functions are not affected. It allow empty always)

Function behavior options

php.ini Defaults

No changes

Open Issues

None

Unaffected PHP Functionality

Existing filter features are not changed at all.

Future Scope

Proposed Voting Choices

This project requires a 2/3 majority

Add validation functions to filter module
Real name Yes No
bwoebi (bwoebi)  
colinodell (colinodell)  
danack (danack)  
derick (derick)  
diegopires (diegopires)  
guilhermeblanco (guilhermeblanco)  
kguest (kguest)  
levim (levim)  
lstrojny (lstrojny)  
marcio (marcio)  
nikic (nikic)  
ocramius (ocramius)  
peehaa (peehaa)  
santiagolizardo (santiagolizardo)  
yohgaki (yohgaki)  
Final result: 1 14
This poll has been closed.

Please choose targeted version for this RFC

Target version
Real name 7.1.0 7.2.0
colinodell (colinodell)  
derick (derick)  
kguest (kguest)  
levim (levim)  
ocramius (ocramius)  
yohgaki (yohgaki)  
Final result: 1 5
This poll has been closed.

Vote start 2016/08/15, ends 2016/08/22 23:59:59 UTC 2016/08/29 23:59:59 UTC

Patches and Tests

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

References

Links to external references, discussions or RFCs

Rejected Features

Keep this updated with features that were discussed on the mail lists.