Rules
data validation rules
This submodule contains the various rules that are used to validate the contents of a file, to determine if it meets the specifications of a layout.
- The rules are divided by the various types:
cell: the most basic rule type. A cell rule generally receives a single value, but can also perform comparison to other value that exist in the same row.
column: a more complex rule type which requires knowledge of the entire list of values present in a single column in order to determine validity.
row: rules that are applied to an entire row. These are not generally meant to be extended by users of this package.
header: rules that extend the concept of row rules, but are meant to apply specifically to the header row of the file.
file: rules that apply to the file, including whether the file exists, or matches a particular naming convention.
Out of the Box
This package contains a number of field types which are already configured with rules to support that particular type of data. For example, the Text field includes rules for maximum length, minimum length (optional), and nullability.
However, this may not be enough for your purposes. Perhaps you need to ensure that your text field only includes ASCII characters. Fortunately, a rule for this already exists in the package, and the Field classes contain a convenient parameter for applying additional rules to a field:
from rumydata.rules import cell
from rumydata.field import Text
my_field = Text(
max_length=10, min_length=5, nullable=False, rules=[cell.AsciiChar()]
)
my_field.check_cell('ABCDE')
With this new rule applied, any data that is validated against this field will be checked for minimum and maximum length, nullability, and whether the characters are all ASCII. You can always test these fields using the check_cell or check_column methods, depending up on the kind of rule that you’re trying to test.
Extension
In the example above, we added a check for ASCII characters only. But what if we need a rule that doesn’t exist in the package? Let’s say that we cannot allow any vowels - A, E, I, O, U - in the cell that we are checking. This package makes it easy to develop custom rules and apply them to your fields:
from rumydata import field
from rumydata import rules
vowel_rule = rules.cell.make_static_cell_rule(
lambda x: all([c.lower() not in ['a', 'e', 'i', 'o', 'u'] for c in x]),
"must not have any vowels"
)
my_field = field.Text(
max_length=10, min_length=5, nullable=False,
rules=[rules.cell.AsciiChar(), vowel_rule]
)
my_field.check_cell('ABCDE')
With our custom vowel_rule, we will now identify any cells that contain values and call this out during validation.
Reference
Cell
cell validation rules
These rules make up the heart of what most users of the rumydata package will be interested in when attempting to extend the out-of-the box behavior. These rules are generally applied to a single value, in the case of the Rule class, but can also be used to compare the value in a cell to another value in the same row, in the case of the ColumnComparisonRule class.
These rules are intended to be used by adding them directly to rules argument in the constructor of the classes in the field submodule.
- class rumydata.rules.cell.NotNull
Bases:
RuleCell not null Rule
- class rumydata.rules.cell.ExactChar(exact_length)
Bases:
RuleCell exact character length Rule
- class rumydata.rules.cell.MinChar(min_length)
Bases:
RuleCell minimum character length Rule
- class rumydata.rules.cell.MaxChar(max_length)
Bases:
RuleCell maximum character length Rule
- class rumydata.rules.cell.AsciiChar
Bases:
RuleCell contains only ASCII character Rule
- class rumydata.rules.cell.NonTrim
Bases:
RuleCell does not have whitespace characters at beginning or end
- class rumydata.rules.cell.Choice(choices: List[str], case_insensitive=False)
Bases:
RuleCell choice Rule
- class rumydata.rules.cell.MinDigit(min_length)
Bases:
RuleCell minimum digit character Rule
Check that count of characters, after removing all non-digits, meets or exceeds the specified minimum. Used to evaluate length of significant digits in numeric strings that might contain formatting.
- class rumydata.rules.cell.MaxDigit(max_length)
Bases:
RuleCell maximum digit character Rule
Check that count of characters, after removing all non-digits, is less than or equal to the specified minimum. Used to evaluate length of significant digits in numeric strings that might contain formatting.
- class rumydata.rules.cell.OnlyNumbers
Bases:
RuleCell only digit characters Rule
- class rumydata.rules.cell.NoLeadingZero
Bases:
RuleCell no leading zero digit Rule
Ensure that there is no leading zero after removing all non-digit characters. A lone zero (0) will not raise an error.
- class rumydata.rules.cell.CanBeFloat
Bases:
RuleCell can be float Rule
- class rumydata.rules.cell.CanBeInteger
Bases:
RuleCell can be integer Rule
- class rumydata.rules.cell.NumericDecimals(max_decimals=2)
Bases:
RuleCell has maximum decimals Rule
- class rumydata.rules.cell.LengthComparison(comparison_value)
Bases:
RuleBase length comparison Rule
- class rumydata.rules.cell.LengthGT(comparison_value)
Bases:
LengthComparisonLength greater than comparison Rule
- class rumydata.rules.cell.LengthGTE(comparison_value)
Bases:
LengthComparisonLength greater than or equal to comparison Rule
- class rumydata.rules.cell.LengthET(comparison_value)
Bases:
LengthComparisonLength equal to comparison Rule
- class rumydata.rules.cell.LengthLTE(comparison_value)
Bases:
LengthComparisonLength less than or equal to comparison Rule
- class rumydata.rules.cell.LengthLT(comparison_value)
Bases:
LengthComparisonLength less than comparison Rule
- class rumydata.rules.cell.NumericComparison(comparison_value)
Bases:
RuleNumeric length comparison base Rule
Base float value comparison class. Requires that the value can be coerced to a float value.
- class rumydata.rules.cell.NumericGT(comparison_value)
Bases:
NumericComparisonNumeric greater than comparison Rule
- class rumydata.rules.cell.NumericGTE(comparison_value)
Bases:
NumericComparisonNumeric greater than or equal to comparison Rule
- class rumydata.rules.cell.NumericET(comparison_value)
Bases:
NumericComparisonNumeric equal to comparison Rule
- class rumydata.rules.cell.NumericLTE(comparison_value)
Bases:
NumericComparisonNumeric less than or equal to comparison Rule
- class rumydata.rules.cell.NumericLT(comparison_value)
Bases:
NumericComparisonNumeric less than comparison Rule
- class rumydata.rules.cell.DateRule(**kwargs)
Bases:
RuleBase date Rule
- class rumydata.rules.cell.DateGT(comparison_value, date_format='%Y-%m-%d', **kwargs)
Bases:
DateComparisonRuleDate greater than comparison Rule
- class rumydata.rules.cell.DateGTE(comparison_value, date_format='%Y-%m-%d', **kwargs)
Bases:
DateComparisonRuleDate greater than or equal to comparison
- class rumydata.rules.cell.DateET(comparison_value, date_format='%Y-%m-%d', **kwargs)
Bases:
DateComparisonRuleDate equal to comparison Rule
- class rumydata.rules.cell.DateLTE(comparison_value, date_format='%Y-%m-%d', **kwargs)
Bases:
DateComparisonRuleDate less than or equal to comparison Rule
- class rumydata.rules.cell.DateLT(comparison_value, date_format='%Y-%m-%d', **kwargs)
Bases:
DateComparisonRuleDate less than comparison Rule
- class rumydata.rules.cell.GreaterThanColumn(compare_to: str | List[str])
Bases:
ColumnComparisonRuleGreater than compared column Rule
- class rumydata.rules.cell.NotNullIfCompare(compare_to: [<class 'str'>, typing.List])
Bases:
ColumnComparisonRule
- class rumydata.rules.cell.GreaterThanOrEqualColumn(compare_to: str | List[str])
Bases:
ColumnComparisonRuleGreater than compared column Rule
- class rumydata.rules.cell.OtherMustExist(compare_to: str | List[str])
Bases:
ColumnComparisonRule
- class rumydata.rules.cell.OtherCantExist(compare_to: str | List[str])
Bases:
ColumnComparisonRule
- class rumydata.rules.cell.LessThanColumn(compare_to: str | List[str])
Bases:
ColumnComparisonRuleLess than compared column Rule
- class rumydata.rules.cell.LessThanOrEqualColumn(compare_to: str | List[str])
Bases:
ColumnComparisonRuleLess than compared column Rule
- class rumydata.rules.cell.NotNullIfOtherEquals(compare_to: str, values: str | List[str])
Bases:
NotNullIfCompareCell cannot be null if other has specified value(s)
- class rumydata.rules.cell.NoScientific
Bases:
RuleCell no scientific notation.
Ensure that there are no scientific notation characters in the cell.
- rumydata.rules.cell.make_static_cell_rule(func, assertion) Rule
Static cell rule factory
Return a factory generated Rule class. The function used by the rule must directly evaluate a single positional argument (i.e. x, but not x and y). Because the Rule cannot be passed a value on initialization, neither the evaluator or explain methods in the return class can be dynamic.
- Parameters:
func – a function which takes a single positional argument
assertion – a string describing the condition which must be met in order for the function to return True
- Returns:
a rumydata.rules.cell.Rule
Column
column validation rules
These rules capture a common, but much more complex use case for data validation, when it is necessary to compare the values of a single column across multiple rows. The most intuitive example of this is the Unique rule, which requires that every value in a column (excepting blanks) be unique/distinct.
These rules are intended to be used by adding them directly to rules argument in the constructor of the classes in the field submodule.
Users of this package should be aware that the introduction of a column rule can have a dramatic increase on the resources required to perform validation. If there are no column validation rules present in a Layout, then each row will be discarded from memory after validation is complete. However, each field that has one or more column rules will require the entire to be available for validation. In small data sets the impact will be minor, but larger data sets have the potential to introduce performance impacts.
- class rumydata.rules.column.Unique
Bases:
RuleColumn values unique Rule