Rules

data validation rules

This submodule contains the various rules that are used to validate the contents of a file, to determine if it meets the specifications of a layout.

The rules are divided by the various types:
  • cell: the most basic rule type. A cell rule generally receives a single value, but can also perform comparison to other value that exist in the same row.

  • column: a more complex rule type which requires knowledge of the entire list of values present in a single column in order to determine validity.

  • row: rules that are applied to an entire row. These are not generally meant to be extended by users of this package.

  • header: rules that extend the concept of row rules, but are meant to apply specifically to the header row of the file.

  • file: rules that apply to the file, including whether the file exists, or matches a particular naming convention.

Out of the Box

This package contains a number of field types which are already configured with rules to support that particular type of data. For example, the Text field includes rules for maximum length, minimum length (optional), and nullability.

However, this may not be enough for your purposes. Perhaps you need to ensure that your text field only includes ASCII characters. Fortunately, a rule for this already exists in the package, and the Field classes contain a convenient parameter for applying additional rules to a field:

from rumydata.rules import cell
from rumydata.field import Text

my_field = Text(
    max_length=10, min_length=5, nullable=False, rules=[cell.AsciiChar()]
)

my_field.check_cell('ABCDE')

With this new rule applied, any data that is validated against this field will be checked for minimum and maximum length, nullability, and whether the characters are all ASCII. You can always test these fields using the check_cell or check_column methods, depending up on the kind of rule that you’re trying to test.

Extension

In the example above, we added a check for ASCII characters only. But what if we need a rule that doesn’t exist in the package? Let’s say that we cannot allow any vowels - A, E, I, O, U - in the cell that we are checking. This package makes it easy to develop custom rules and apply them to your fields:

from rumydata import field
from rumydata import rules

vowel_rule = rules.cell.make_static_cell_rule(
    lambda x: all([c.lower() not in ['a', 'e', 'i', 'o', 'u'] for c in x]),
    "must not have any vowels"
)

my_field = field.Text(
    max_length=10, min_length=5, nullable=False,
    rules=[rules.cell.AsciiChar(), vowel_rule]
)

my_field.check_cell('ABCDE')

With our custom vowel_rule, we will now identify any cells that contain values and call this out during validation.

Reference

Cell

cell validation rules

These rules make up the heart of what most users of the rumydata package will be interested in when attempting to extend the out-of-the box behavior. These rules are generally applied to a single value, in the case of the Rule class, but can also be used to compare the value in a cell to another value in the same row, in the case of the ColumnComparisonRule class.

These rules are intended to be used by adding them directly to rules argument in the constructor of the classes in the field submodule.

class rumydata.rules.cell.NotNull

Bases: Rule

Cell not null Rule

class rumydata.rules.cell.ExactChar(exact_length)

Bases: Rule

Cell exact character length Rule

class rumydata.rules.cell.MinChar(min_length)

Bases: Rule

Cell minimum character length Rule

class rumydata.rules.cell.MaxChar(max_length)

Bases: Rule

Cell maximum character length Rule

class rumydata.rules.cell.AsciiChar

Bases: Rule

Cell contains only ASCII character Rule

class rumydata.rules.cell.NonTrim

Bases: Rule

Cell does not have whitespace characters at beginning or end

class rumydata.rules.cell.Choice(choices: List[str], case_insensitive=False)

Bases: Rule

Cell choice Rule

class rumydata.rules.cell.MinDigit(min_length)

Bases: Rule

Cell minimum digit character Rule

Check that count of characters, after removing all non-digits, meets or exceeds the specified minimum. Used to evaluate length of significant digits in numeric strings that might contain formatting.

class rumydata.rules.cell.MaxDigit(max_length)

Bases: Rule

Cell maximum digit character Rule

Check that count of characters, after removing all non-digits, is less than or equal to the specified minimum. Used to evaluate length of significant digits in numeric strings that might contain formatting.

class rumydata.rules.cell.OnlyNumbers

Bases: Rule

Cell only digit characters Rule

class rumydata.rules.cell.NoLeadingZero

Bases: Rule

Cell no leading zero digit Rule

Ensure that there is no leading zero after removing all non-digit characters. A lone zero (0) will not raise an error.

class rumydata.rules.cell.CanBeFloat

Bases: Rule

Cell can be float Rule

class rumydata.rules.cell.CanBeInteger

Bases: Rule

Cell can be integer Rule

class rumydata.rules.cell.NumericDecimals(max_decimals=2)

Bases: Rule

Cell has maximum decimals Rule

class rumydata.rules.cell.LengthComparison(comparison_value)

Bases: Rule

Base length comparison Rule

class rumydata.rules.cell.LengthGT(comparison_value)

Bases: LengthComparison

Length greater than comparison Rule

class rumydata.rules.cell.LengthGTE(comparison_value)

Bases: LengthComparison

Length greater than or equal to comparison Rule

class rumydata.rules.cell.LengthET(comparison_value)

Bases: LengthComparison

Length equal to comparison Rule

class rumydata.rules.cell.LengthLTE(comparison_value)

Bases: LengthComparison

Length less than or equal to comparison Rule

class rumydata.rules.cell.LengthLT(comparison_value)

Bases: LengthComparison

Length less than comparison Rule

class rumydata.rules.cell.NumericComparison(comparison_value)

Bases: Rule

Numeric length comparison base Rule

Base float value comparison class. Requires that the value can be coerced to a float value.

class rumydata.rules.cell.NumericGT(comparison_value)

Bases: NumericComparison

Numeric greater than comparison Rule

class rumydata.rules.cell.NumericGTE(comparison_value)

Bases: NumericComparison

Numeric greater than or equal to comparison Rule

class rumydata.rules.cell.NumericET(comparison_value)

Bases: NumericComparison

Numeric equal to comparison Rule

class rumydata.rules.cell.NumericLTE(comparison_value)

Bases: NumericComparison

Numeric less than or equal to comparison Rule

class rumydata.rules.cell.NumericLT(comparison_value)

Bases: NumericComparison

Numeric less than comparison Rule

class rumydata.rules.cell.DateRule(**kwargs)

Bases: Rule

Base date Rule

class rumydata.rules.cell.CanBeDateIso(**kwargs)

Bases: DateRule

Can be ISO-8601 date Rule

class rumydata.rules.cell.DateGT(comparison_value, date_format='%Y-%m-%d', **kwargs)

Bases: DateComparisonRule

Date greater than comparison Rule

class rumydata.rules.cell.DateGTE(comparison_value, date_format='%Y-%m-%d', **kwargs)

Bases: DateComparisonRule

Date greater than or equal to comparison

class rumydata.rules.cell.DateET(comparison_value, date_format='%Y-%m-%d', **kwargs)

Bases: DateComparisonRule

Date equal to comparison Rule

class rumydata.rules.cell.DateLTE(comparison_value, date_format='%Y-%m-%d', **kwargs)

Bases: DateComparisonRule

Date less than or equal to comparison Rule

class rumydata.rules.cell.DateLT(comparison_value, date_format='%Y-%m-%d', **kwargs)

Bases: DateComparisonRule

Date less than comparison Rule

class rumydata.rules.cell.GreaterThanColumn(compare_to: str | List[str])

Bases: ColumnComparisonRule

Greater than compared column Rule

class rumydata.rules.cell.NotNullIfCompare(compare_to: [<class 'str'>, typing.List])

Bases: ColumnComparisonRule

class rumydata.rules.cell.GreaterThanOrEqualColumn(compare_to: str | List[str])

Bases: ColumnComparisonRule

Greater than compared column Rule

class rumydata.rules.cell.OtherMustExist(compare_to: str | List[str])

Bases: ColumnComparisonRule

class rumydata.rules.cell.OtherCantExist(compare_to: str | List[str])

Bases: ColumnComparisonRule

class rumydata.rules.cell.LessThanColumn(compare_to: str | List[str])

Bases: ColumnComparisonRule

Less than compared column Rule

class rumydata.rules.cell.LessThanOrEqualColumn(compare_to: str | List[str])

Bases: ColumnComparisonRule

Less than compared column Rule

class rumydata.rules.cell.NotNullIfOtherEquals(compare_to: str, values: str | List[str])

Bases: NotNullIfCompare

Cell cannot be null if other has specified value(s)

class rumydata.rules.cell.NoScientific

Bases: Rule

Cell no scientific notation.

Ensure that there are no scientific notation characters in the cell.

rumydata.rules.cell.make_static_cell_rule(func, assertion) Rule

Static cell rule factory

Return a factory generated Rule class. The function used by the rule must directly evaluate a single positional argument (i.e. x, but not x and y). Because the Rule cannot be passed a value on initialization, neither the evaluator or explain methods in the return class can be dynamic.

Parameters:
  • func – a function which takes a single positional argument

  • assertion – a string describing the condition which must be met in order for the function to return True

Returns:

a rumydata.rules.cell.Rule

Column

column validation rules

These rules capture a common, but much more complex use case for data validation, when it is necessary to compare the values of a single column across multiple rows. The most intuitive example of this is the Unique rule, which requires that every value in a column (excepting blanks) be unique/distinct.

These rules are intended to be used by adding them directly to rules argument in the constructor of the classes in the field submodule.

Users of this package should be aware that the introduction of a column rule can have a dramatic increase on the resources required to perform validation. If there are no column validation rules present in a Layout, then each row will be discarded from memory after validation is complete. However, each field that has one or more column rules will require the entire to be available for validation. In small data sets the impact will be minor, but larger data sets have the potential to introduce performance impacts.

class rumydata.rules.column.Unique

Bases: Rule

Column values unique Rule