Maximilian Stark (mail@dakror.de), SS2020
Stand: 30.07.2020
Advanced Topics of Software Testing
Test Selection
- Similar techniques
- Inspections
- Walkthroughs
- Reviews
- Fault: concrete mistake in the software causing errors
- Error: State deviation from ideal
- Failure: Behavior deviation resulting from errors
- Good tests: Ability to detect likely failures with good cost-effectiveness
- Grouping of input equivalence classes
- based on control flow
- based on data flow
- based on stochastic profiles
- Limit Testing
Testing in Numbers
- Most bugs in requirements
- Most detection very late in integration process
- From 100 Faults / KLOC at development to 1-2 at release after testing
- Code reading as most effective fault detection
- Decreasing proportion of errors during construction with increasing project size
- Logarithmic fault density (80% of faults from 20% of modules)
Types of Testing
- Unit testing
- Integration testing
- Caller-replacement: Driver
- Callee-replacement: Stub
- Top down: only stubs
- Bottom up: only drivers
- System testing
- Acceptance testing
- Black-Box vs White-Box testing
- BB: Pure functionality testing
- WB: System under test structure testing
- Robustness testing
- Performance testing
- Regression testing
- E2E testing (GUI)
- Testing pyramid
- Order of task cost and speed
- Unit
- Integration
- E2E
- Manual
- Big-Bang Testing
- Testing of individual big modules
- Testing of whole system
Perspectives of testing
- Technology-facing programmer support: As coding guidelines
- Technology-facing product critiques: NFR testing
- Business-facing team support: Stories to help team understand requirements
- Business-facing product critiques: Business logic testing
Test selection
Requirements-based testing
- Goal: test case per requirements
- In agile: User stories instead of fixed requirements
- black box functional testing
- coverage testing
- application and domain specific: at the methodological level
Equivalence partitioning
- Requirement of different a-priori likelihood of faults per partition group
- Determining of equivalent input types and patterns
- Requirements-based: Manually aggregated
- Category partition method
- Identify categories, aka units
- Partition: choices of unit parameters
- Constraints among partition parameter settings
- Generation of combinations: Test-frames
- Transformation of test-frames into test cases by instantiating parameter choices
- Classification tree method
- Observation of input space regarding different dimensions
- Construction of disjoint sets in dimensions
- Instantiation of tests to cover whole domain
- Number of testcases: Product of number of elements per root classes
- Min number of testcases
- -> Combinatorial testing
- max of number of elements of a given root class
- Biggest class lower-bounds testcase selection
- Smaller classes can easily be distributed among too many testcases
Control Flow-Based Testing
- Used for test assessment, not test selection!
- Achieve maximum coverage
- statement coverage
- branch coverage
- execution of every edge in control flow graph
- condition coverage
- branches with complex conditionals:
if(a && b)
- must cover both true and false per input, regardless of result
- does not subsume branch coverage
- condition / decision coverage
- must cover both true and false per input
- evaluate both to true and to false
- subsumes branch coverage
- modified condition / decision coverage
- Recommended for requirements-based testing
- Testing of impact of single literal change
- Per literal: Selection of two test cases where only one literal changes and the output does too
- subsumes condition/decision coverage
- multiple condition coverage
- Combination of all input values: $2^n$
- Infeasible
- path coverage
- Coverage and unwinding of all entry-to-exit paths
- Infinite cases with loops without upper bound
- Feasible paths: limits on loops and nested conditionals
- Non-strict evaluation: Missing of input values left behind in a conditional, like
a || b
- Dependent parameters: Impossibility to achieve 100% coverage
- Boundary-interior tests
- Skip loops
- Loop once
- Loop multiple times
Combinatorial testing
- Pruning of very large CTE trees
- Consideration of application specific fault model
- Observation that faults occur through interaction of between 2-4 inputs
- -> Pairwise testing of parameters
- Latin squares: Sudoku-like setup for grid of size 3
- X,Y,Value as values for the parameters
Random Testing
- Black box testing
- Random inputs and testing for exception / no exception = robustness testing
General approaches
- Statistical testing
- Frequency analysis instead of uniform distribution
- Usage of Operational profiles
- Use case descriptions with probabilities
- Risk-based testing: Prioritization of important interactions
- Reliability assessments
- Representation through Markov chains
- Flaw: low probability of disasters
- GUI Random testing
- Vast input domain
- Markov Chains & FSM not scalable enough
- Random triggering of events for test cases
- Disadvantage: missing of deep actions
- Random Walk on specification
- Random Walk on environment models
- Random testing for AI: Not suited
- Adaptive Random Testing
- Utilizing patterns of failure-causing inputs
- Point (Grid) pattern
- Strip Pattern
- Block pattern
- Evenly spreading out test cases within input domain
- Random test candidate pool generation
- Selection of test farthest away from already executes tests
- Significant improvement over random testing (52%)
- On average less test cases needed to find first failure
Fuzzing
- Checking for software vulnerabilities through dynamic testing
- Generation of slightly abnormal fuzz test cases
- Random generation
- Model based generation
- Predefined model
- Inferred model: Specification mining
- Mutation based (model-less) generation
- Bit-flipping
- Arithmetic
- Block-based
- Dictionary-based
- Crossover: Genetic Algorithms
- Program crashes seen as security violations
- Memory related crashes detected by OS
- Usage of sanitizers to catch non-fatal program errors
- Address sanitizers
- UB sanitizers
- Thread sanitizers
- Bug triage
- De-duplication
- Prioritization & Exploitability
- Test case minimization
- Approaches
- Black-Box fuzzing
- No knowledge about internals
- Used on 3rd party software
- Does not reach deeper levels
- Runs same paths again and again
- Strongly dependent on seeds
- White-Box fuzzing
- Symbolic execution
- Heavyweight runtime and code analysis to generate inputs for new program paths
- Does not scale
- Grey-Box fuzzing
- lightweight program instrumentation for runtime information gathering
- Aggregation of interesting test inputs for further mutations
- Prioritization of newly found inputs
- Unique program crashes
- Coverage
- Distance to potentially vulnerable code locations
- Best of both worlds from BB and WB
- Fuzzing timeout in the scale of hours
Tools
- RANDOOP: JUnit test case generation from method sequences and constructor invocations
- AutoTest: Contract-based test generation
- Simulant: random agent simulation based test generation
- GramTest: input grammar based test generation
- QuickCheck: random sequence of API calls based on user defined model
- Property-based testing: Definition of logical properties as assertions
- Generation of test cases to violate assertions
- Inverse operations
- Different paths
- Invariants
- Idempotent operations
- Reference implementation
- Minimization of input for failure
Analysis Random Testing vs. Partition-Based Testing
- Analytical comparisons of random testing with partition-based testing
- Criterion: Ability to detect at least one failure
Weyuker-Jeng Model
- Programs $P$ with input domain $D,~ |D| = d$, $m$ points produce incorrect output
- $d >> m$
- Failure rate $\theta$ is probability of selecting a failure causing input
- $\theta_i = \frac{m_i}{|D_i|}$ failure rate of block (subdomain) $D_i$
- With operational distribution (statistic analysis): $p_i$ probability of test case belonging to $D_i$ with $\cup_{1 \leq i \leq k} D_i = D$
- $\theta = \sum_{1\leq i \leq k} p_i \theta_i$
- Uniform distribution: $p_i = \frac{|D_i|}{d};~\theta = \sum_{1\leq i \leq k}\frac{|D_i|}{d}*\theta_i = \frac{m}{d}$
- Probability of random testing causing at least one failure $P_r = 1 - (1 - \theta)^n$
- Probability of partition testing causing at least one failure $P_p = 1 - \prod_{1\leq i \leq k}(1-\theta_i)^{n_i}$, $n_i$ test cases per block $D_i$
- Goal: Maximizing $P_p$
- Best case: All failure-causing inputs in single block
- Worst case
- $n-1$ tests from $k-1$ blocks with only safe inputs, final block for all failure-causing inputs
- Random testing: $n$ tries with $\theta=\frac{m}{d}$
- Partition testing: $1$ try with $\theta=\frac{m}{d-n+1}$
- Quality of Partition-based testing dependent on failure distribution
- Blocks of equal size and uniform test picking: $P_p \geq P_r$
- More general: Requirement of consistent failure rate among blocks
- Refining of blocks into more equally sized sub-blocks always advantageous
- $P_r = P_p$ when $\theta_{1..k}$ equal, $n_{i..k}$ equal and distribution of failure-causing inputs among blocks equal
Gutjahr's Model
- Classifying group of programs according to "difficulty"
- Replacement of deterministic numbers of failure-causing inputs by random variables with expected values and std.dev
- Critical (controversial) assumption that expected failure rates is equal among blocks
- $\frac{\overline{P_r}}{k} \geq \overline{P_r} \geq \overline{P_p}$
- Now partition based testing at least as good as random testing
Adaptive Partition Testing
- Combination of Partition and Random testing
- Leveraging of natural clustering of faults
- Main idea: Probability of choosing block and test within increased or decreased depending on test outcome
- Algorithms
- Markov-chain based
- Block probability as entry in Markov matrix
- Columns sum to 1
- Adjustment parameters $\gamma = 0.1$ and $\tau = 0.1$ for changing probability of transition and staying in block respectively
- Reward based
Test Selection II
Concurrency testing
- Causes
- Deadlocks: Coffman conditions
- Mutual exclusion
- No preemption
- Hold and wait
- Circular wait
- Livelocks
- Starvation
- Suspension
- Order violation (race conditions)
- Priority inversion
- Atomicity violation
- Bug patterns
- 97% of non-deadlock bugs are atomicity- or order violations
- 32% of non-deadlock bugs are order violations
- 73% of non-deadlock bugs fixed by technique other than locks
- 4% of bugs related to concurrency in Apache software
- Testing strategies
- Defect-based
- Long duration / Random testing
- Model based
- Manual delays
Testing for Cyber-physical systems
- Unhelpful requirements: "drive safely"
- Intuition: Situations difficult for human drivers
- Intuition: Collision always bad
- No clear oracle available (expected output encoder)
- Operation in dynamic open context
- X-in-the-Loop testing with simulations
- Gradual increase of scenario realism
- Challenge: Only NFR as testing goals
- Testing of continuous controller signals
- Testing for stability
- Input space discretization through start and goal values
- Search-based optimization with fitness function with threshold oracle
- Creation of complex fitness function by combining fitness functions of individual scenario goal components
- Genetic algorithms
- Definition of fault: Violation of safe operation envelope
- Scenario-based testing
Testing Machine-Learned models
- Test goal: Safety
- Pixel perturbation
- Robustness testing: resistance to adversarial inputs
- Symbolic execution
- Translation of NN into imperative program
- Detection of most important pixels in labeled image
- Creation of adversarial image from original image and important pixels as perturbations
- Creation of constraint-solution problem for
- Same execution path until label output
- Change of label due to pixel change
- Neuron Coverage
- Ratio of activated neurons by test input
- Partitioning of input space by neuron coverage
- Selection of seed image and application of synthetic image transformations
- Goal: Increase of neuron coverage
- Test Oracle
- Majority voting by sibling NNs
- Domain-specific metamorphic relations (-> QuickCheck)
- Coverage-guided fuzzing: Discard decision based on coverage change
Test Ending Criteria
- Difficult assessment of test suite quality and test selection criteria
- Fault Injection: Run tests on new system and evaluate test-detection
- Mutation Analysis
- Small syntactic code change
- evaluation of test-detection on mutant programs
- Equivalent mutants: Undecidable problem
- Assumptions
- Competent programmer hypothesis: Only syntactical faults
- Coupling hypothesis: Correlation of syntactic faults with other kinds of faults
- Assessing selection criteria: comparison of failure detection ability of
- implementations wrt. single spec.
- mutations of implementation wrt. single spec.
- implementations wrt. respective spec.
- Real or seeded faults
Regression Testing
- Retest-all: Expensive at scale
- Increase of cost effectiveness of regression tests
- Test prioritization
- Ordering of tests according to selected surrogate property (coverage, historical failures)
- Evaluation
- Average percentage of fault detection (APFD)
- Cost aware APFD
- No one-fits-all surrogate
- Test suite minimization
- Pruning of redundant test cases
- Evaluation
- Size / execution time reduction
- Reduction in fault-detection effectiveness
- Test selection (RTS)
- Execution of only relevant tests
- Evaluation
- Inclusiveness / Safety: false negatives
- Precision: false positives
- Expensive
- Test optimization approaches
- Dynamic program analysis
- Focus on execution traces of tests
- Different hypotheses for different techniques
- Static program analysis
- Approximation of execution behavior by operating on different static representations of programs (CFGs)
- Same hypotheses, less precise methods
- Problem: Reflection
- Predictive techniques
- Execution history
- Code authorship
- Commit size and frequency
- File extensions
Model-based Testing
- Goal: Solving Oracle problem, bad expected output specifications
- Explicit model replacement of mental specification model
- Test case specification for model-behavior filtering for individual tests
- Verification: Comparison of system result with model output
- Validation: Comparison of model behavior with specification
- Required abstraction from SUT to model
- Needed (partial) model of environment
- Abstraction approaches
- Driver components as bridges from abstract model and SUT (50% of testing effort)
- Scenarios
- Test and code generated from single model
- Model driven development
- Assumptions on the environment
- No testing is possible, system and test equivalent
- Require redundancy between tests and system
- System generated by model and tests generated from model
- Extremely expensive
- Roundtrip problem: Backpropagation of system code changes
- Model for test only
- Traditional specification to build system
- Provides redundancy
- Model extraction from code
- SUT as black-box, model as white-box
- Test selection criteria (for FSM)
- Functional
- Domain & requirements specific
- Methodological support
- Ad-Hoc
- Structural
- Domain independent
- Data flow, control flow, data
- Measurable
- Unclear ability of revealing faults
- Models of SUT and env.
- Stochastic
- Fault-based
- Test Case Generation
- Dedicated algorithms for specific criteria
- Bounded model checking
- Symbolic execution
- Heuristic search
- Cost effectiveness
- Little reuse
- Model errors
- Structured tests complement functional tests
Testing OO Software
- Inherently specific fault models
- Software characteristics impacting testing
- State dependent behavior
- Encapsulation
- Inheritance
- Polymorphism and dynamic binding
- Abstract & generic classes
- Exceptions
Testing strategies
- State-based testing
- Interpretation of code as EFSM (FSM + state)
- Harel Statecharts
- Mealy Machines (FSM with input dependent output) with
- Hierarchy
- Parallel composition
- Broadcast communication
- Incomplete specifications: Illegal execution exception
- Fault models / Defect testing
- Missing or incorrect transition: Resulting state is incorrect yet not corrupt
- Missing or incorrect action: Wrong results of executing transition
- Extra, missing or corrupt state: UB
- Sneak-path or illegal message failure
- Trap-door
- Composition failures: Conflict between sub- and superclass
- Test approaches
- Testing of possible faults from transitions and sequences
- Limit testing for state invariants and transition legality
- EFSM-based testing
- Specification into State machine and possibly Statechart
- Deriving of test case operation sequences from state machine
- Requirement of object state access for test oracle output verification
- Testing with inheritance
- Fault models
- Incorrect initialization
- Inadvertent bindings (shadowing)
- Naked access
- Incorrect location of subclass (Liskov)
- Naughty children: Illegal state for superclass (Liskov)
- Worm holes: Inconsistent state compared to superclass invariants
- Spaghetti inheritance
- Testing approaches
- Hierarchy flattening
- Test reuse (Testing history) related to superclass
- Testing with polymorphism
- Combinatorial explosion problem
- Testing approaches
- Polymorphic Server test
- Verification of Liskov compliance
- Testing of all inheritance related fault models without sequencing constraints
- Testing of all callee combinations
- Modal Hierarchy test
- Flattening of hierarchies
- Constraints on method sequencing
- Polymorphic Message test
- Inverse of Server test
- Testing of all caller combinations
- Testing with exception handlers
- Implicit control flow created by exceptions
- Impracticality to treat exception flow as normal flow due to sheer numbers
- Testing of each handler and utilizing throw / re-throw
- For non-local exception handlers: Enforcing & testing for design rule: methods with throw must be side-effect free
From Failures to Faults
- Brute Force: Step-debugging
- Binary Search
- Delta debugging
- Convergence on fault input with adaptive granularity
- Starts with failing test case top-down and passing test bottom-up regarding some set of inputs
- Granularity change possible through subset and complement
- Algorithm
- Split of set into n subsets
- Foreach subset: Test complement of subset
- Pass: Continue
- Fail
- split original set into max(2, N - 1) subsets
- N - 1 to skip previous test on subset complement
- Repeat loop with new subsets
- If no more complements, increase granularity
- End cycle if end of granularity increase reached
- Fault localization
- A posteriori hints (localization after test results)
- Based on program spectra
- Abstraction of program runs
- Single or multiple fault
- Collection of data providing specific view of dynamic behavior (block hit counters)
- Indication of correlation of block hits and failures
- Similarity coefficients $a_{ij},~i\text{ block hit }, j~\text{failure}$
- Jaccard: $\frac{a_{11}}{a_{11}+a_{01}+a_{10}}$
- Tarantula: $\frac{\frac{a_{11}}{a_{11}+a_{01}}}{\frac{a_{11}}{a_{11}+a_{01}}+\frac{a_{10}}{a_{10}+a_{00}}}$
- AMPLE: $\left|\frac{a_{11}}{a_{11}+a_{01}}-\frac{a_{11}}{a_{10}+a_{00}}\right|$
- Ochiai: $\frac{a_{11}}{\sqrt{(a_{11}+a_{01})*(a_{11}+a_{01})}}$
- Improvements
- Syntactic block granularity (try, if, method)
- Tackling confounding bias (dependency between elements)
- Re-ranking algorithm taking into account dynamic call graph and explicit data-dependency of failed tests
- Recursive inspection of parents of falsely identified faulty element in dynamic call graph
- A priori hints
- Utilizing data collected on bug-reports and their impact on code-properties
- Code complexity
- Problem domain
- Past history
- Process quality
- No a-priori correlation of metrics and faults
- Based on code metrics
- Based on code churn (code changes): Very good indicator
- Positively influencing factors
- Team membership
- Local vs distributed teams: insignificant
- Programming languages
- Domain
- Failure Clustering
- Few faults cause many tests to fail
- Group failures originated from same fault and in investigate representative
- Coverage approach
- Usage of fault localization techniques to decide cluster number (similarity metrics)
- Hierarchical clustering of failing test cases by similar coverage profiles
- Representative as test closest to cluster center
- 80% reduction in analysis time when using function level granularity
- 100% of faults from representative-inspection
- Non-coverage approach using code-independent data
- Failed / passed history
- Broken / Repaired history
- Jira history
- Usage of regression model for clustering
- 60% reduction in analysis time
- 80% of faults from representative-inspection