Advanced Topics of Software Testing

Test Selection

Similar techniques
- Inspections
- Walkthroughs
- Reviews
Fault: concrete mistake in the software causing errors
Error: State deviation from ideal
Failure: Behavior deviation resulting from errors
Good tests: Ability to detect likely failures with good cost-effectiveness
Grouping of input equivalence classes
- based on control flow
- based on data flow
- based on stochastic profiles
Limit Testing

Testing in Numbers

Most bugs in requirements
Most detection very late in integration process
From 100 Faults / KLOC at development to 1-2 at release after testing
Code reading as most effective fault detection
Decreasing proportion of errors during construction with increasing project size
Logarithmic fault density (80% of faults from 20% of modules)

Types of Testing

Unit testing
Integration testing
- Caller-replacement: Driver
- Callee-replacement: Stub
- Top down: only stubs
- Bottom up: only drivers
System testing
Acceptance testing
Black-Box vs White-Box testing
- BB: Pure functionality testing
- WB: System under test structure testing
Robustness testing
Performance testing
Regression testing
E2E testing (GUI)
Testing pyramid
- Order of task cost and speed
  1. Unit
  2. Integration
  3. E2E
  4. Manual
Big-Bang Testing
1. Testing of individual big modules
2. Testing of whole system

Perspectives of testing

Technology-facing programmer support: As coding guidelines
Technology-facing product critiques: NFR testing
Business-facing team support: Stories to help team understand requirements
Business-facing product critiques: Business logic testing

Test selection

Requirements-based testing

Goal: test case per requirements
In agile: User stories instead of fixed requirements
black box functional testing
coverage testing
application and domain specific: at the methodological level

Equivalence partitioning

Requirement of different a-priori likelihood of faults per partition group
Determining of equivalent input types and patterns
Requirements-based: Manually aggregated
Category partition method
- Identify categories, aka units
- Partition: choices of unit parameters
- Constraints among partition parameter settings
- Generation of combinations: Test-frames
- Transformation of test-frames into test cases by instantiating parameter choices
Classification tree method
- Observation of input space regarding different dimensions
- Construction of disjoint sets in dimensions
- Instantiation of tests to cover whole domain
- Number of testcases: Product of number of elements per root classes
- Min number of testcases
  - -> Combinatorial testing
  - max of number of elements of a given root class
  - Biggest class lower-bounds testcase selection
  - Smaller classes can easily be distributed among too many testcases

Control Flow-Based Testing

Used for test assessment, not test selection!
Achieve maximum coverage
1. statement coverage
  - useless
2. branch coverage
  - execution of every edge in control flow graph
3. condition coverage
  - branches with complex conditionals: if(a && b)
  - must cover both true and false per input, regardless of result
  - does not subsume branch coverage
4. condition / decision coverage
  - must cover both true and false per input
  - evaluate both to true and to false
  - subsumes branch coverage
5. modified condition / decision coverage
  - Recommended for requirements-based testing
  - Testing of impact of single literal change
  - Per literal: Selection of two test cases where only one literal changes and the output does too
  - subsumes condition/decision coverage
6. multiple condition coverage
  - Combination of all input values: $2^n$
  - Infeasible
7. path coverage
  - Coverage and unwinding of all entry-to-exit paths
  - Infinite cases with loops without upper bound
  - Feasible paths: limits on loops and nested conditionals
Non-strict evaluation: Missing of input values left behind in a conditional, like a || b
Dependent parameters: Impossibility to achieve 100% coverage
Boundary-interior tests
- Skip loops
- Loop once
- Loop multiple times

Combinatorial testing

Pruning of very large CTE trees
Consideration of application specific fault model
- Observation that faults occur through interaction of between 2-4 inputs
- -> Pairwise testing of parameters
- Latin squares: Sudoku-like setup for grid of size 3
  - X,Y,Value as values for the parameters

Random Testing

Black box testing
Random inputs and testing for exception / no exception = robustness testing

General approaches

Statistical testing
- Frequency analysis instead of uniform distribution
- Usage of Operational profiles
  - Use case descriptions with probabilities
  - Risk-based testing: Prioritization of important interactions
  - Reliability assessments
  - Representation through Markov chains
  - Flaw: low probability of disasters
GUI Random testing
- Vast input domain
- Markov Chains & FSM not scalable enough
- Random triggering of events for test cases
- Disadvantage: missing of deep actions
Random Walk on specification
Random Walk on environment models
Random testing for AI: Not suited
Adaptive Random Testing
- Utilizing patterns of failure-causing inputs
  - Point (Grid) pattern
  - Strip Pattern
  - Block pattern
- Evenly spreading out test cases within input domain
  1. Random test candidate pool generation
  2. Selection of test farthest away from already executes tests
- Significant improvement over random testing (52%)
- On average less test cases needed to find first failure

Fuzzing

Checking for software vulnerabilities through dynamic testing
Generation of slightly abnormal fuzz test cases
- Random generation
- Model based generation
  - Predefined model
  - Inferred model: Specification mining
- Mutation based (model-less) generation
  - Bit-flipping
  - Arithmetic
  - Block-based
  - Dictionary-based
  - Crossover: Genetic Algorithms
Program crashes seen as security violations
- Memory related crashes detected by OS
- Usage of sanitizers to catch non-fatal program errors
  - Address sanitizers
  - UB sanitizers
  - Thread sanitizers
Bug triage
- De-duplication
- Prioritization & Exploitability
- Test case minimization
Approaches
- Black-Box fuzzing
  - No knowledge about internals
  - Used on 3rd party software
  - Does not reach deeper levels
  - Runs same paths again and again
  - Strongly dependent on seeds
- White-Box fuzzing
  - Symbolic execution
  - Heavyweight runtime and code analysis to generate inputs for new program paths
  - Does not scale
- Grey-Box fuzzing
  - lightweight program instrumentation for runtime information gathering
  - Aggregation of interesting test inputs for further mutations
  - Prioritization of newly found inputs
    - Unique program crashes
    - Coverage
    - Distance to potentially vulnerable code locations
  - Best of both worlds from BB and WB
- Fuzzing timeout in the scale of hours

Tools

RANDOOP: JUnit test case generation from method sequences and constructor invocations
AutoTest: Contract-based test generation
Simulant: random agent simulation based test generation
GramTest: input grammar based test generation
QuickCheck: random sequence of API calls based on user defined model
- Property-based testing: Definition of logical properties as assertions
- Generation of test cases to violate assertions
  - Inverse operations
  - Different paths
  - Invariants
  - Idempotent operations
  - Reference implementation
- Minimization of input for failure

Analysis Random Testing vs. Partition-Based Testing

Analytical comparisons of random testing with partition-based testing
Criterion: Ability to detect at least one failure

Weyuker-Jeng Model

Programs $P$ with input domain $D,~ |D| = d$, $m$ points produce incorrect output
$d >> m$
Failure rate $\theta$ is probability of selecting a failure causing input
$\theta_i = \frac{m_i}{|D_i|}$ failure rate of block (subdomain) $D_i$
With operational distribution (statistic analysis): $p_i$ probability of test case belonging to $D_i$ with $\cup_{1 \leq i \leq k} D_i = D$
$\theta = \sum_{1\leq i \leq k} p_i \theta_i$
Uniform distribution: $p_i = \frac{|D_i|}{d};~\theta = \sum_{1\leq i \leq k}\frac{|D_i|}{d}*\theta_i = \frac{m}{d}$
Probability of random testing causing at least one failure $P_r = 1 - (1 - \theta)^n$
Probability of partition testing causing at least one failure $P_p = 1 - \prod_{1\leq i \leq k}(1-\theta_i)^{n_i}$, $n_i$ test cases per block $D_i$
Goal: Maximizing $P_p$
- Best case: All failure-causing inputs in single block
- Worst case
  - $n-1$ tests from $k-1$ blocks with only safe inputs, final block for all failure-causing inputs
  - Random testing: $n$ tries with $\theta=\frac{m}{d}$
  - Partition testing: $1$ try with $\theta=\frac{m}{d-n+1}$
- Quality of Partition-based testing dependent on failure distribution
- Blocks of equal size and uniform test picking: $P_p \geq P_r$
  - More general: Requirement of consistent failure rate among blocks
- Refining of blocks into more equally sized sub-blocks always advantageous
- $P_r = P_p$ when $\theta_{1..k}$ equal, $n_{i..k}$ equal and distribution of failure-causing inputs among blocks equal

Gutjahr's Model

Classifying group of programs according to "difficulty"
Replacement of deterministic numbers of failure-causing inputs by random variables with expected values and std.dev
Critical (controversial) assumption that expected failure rates is equal among blocks
- $\frac{\overline{P_r}}{k} \geq \overline{P_r} \geq \overline{P_p}$
Now partition based testing at least as good as random testing

Adaptive Partition Testing

Combination of Partition and Random testing
Leveraging of natural clustering of faults
Main idea: Probability of choosing block and test within increased or decreased depending on test outcome
Algorithms
- Markov-chain based
  - Block probability as entry in Markov matrix
  - Columns sum to 1
  - Adjustment parameters $\gamma = 0.1$ and $\tau = 0.1$ for changing probability of transition and staying in block respectively
- Reward based

Test Selection II

Concurrency testing

Causes
- Deadlocks: Coffman conditions
  - Mutual exclusion
  - No preemption
  - Hold and wait
  - Circular wait
- Livelocks
- Starvation
- Suspension
- Order violation (race conditions)
- Priority inversion
- Atomicity violation
Bug patterns
- 97% of non-deadlock bugs are atomicity- or order violations
- 32% of non-deadlock bugs are order violations
- 73% of non-deadlock bugs fixed by technique other than locks
- 4% of bugs related to concurrency in Apache software
Testing strategies
- Defect-based
- Long duration / Random testing
- Model based
- Manual delays

Testing for Cyber-physical systems

Unhelpful requirements: "drive safely"
- Intuition: Situations difficult for human drivers
- Intuition: Collision always bad
No clear oracle available (expected output encoder)
Operation in dynamic open context
X-in-the-Loop testing with simulations
- Gradual increase of scenario realism
Challenge: Only NFR as testing goals
Testing of continuous controller signals
- Testing for stability
- Input space discretization through start and goal values
- Search-based optimization with fitness function with threshold oracle
  - Creation of complex fitness function by combining fitness functions of individual scenario goal components
- Genetic algorithms
Definition of fault: Violation of safe operation envelope
Scenario-based testing

Testing Machine-Learned models

Test goal: Safety
Pixel perturbation
Robustness testing: resistance to adversarial inputs
- Symbolic execution
  1. Translation of NN into imperative program
  2. Detection of most important pixels in labeled image
  3. Creation of adversarial image from original image and important pixels as perturbations
    - Creation of constraint-solution problem for
      - Same execution path until label output
      - Change of label due to pixel change
- Neuron Coverage
  - Ratio of activated neurons by test input
    1. Partitioning of input space by neuron coverage
    2. Selection of seed image and application of synthetic image transformations
  - Goal: Increase of neuron coverage
  - Test Oracle
    - Majority voting by sibling NNs
    - Domain-specific metamorphic relations (-> QuickCheck)
- Coverage-guided fuzzing: Discard decision based on coverage change

Test Ending Criteria

Difficult assessment of test suite quality and test selection criteria
Fault Injection: Run tests on new system and evaluate test-detection
Mutation Analysis
- Small syntactic code change
- evaluation of test-detection on mutant programs
- Equivalent mutants: Undecidable problem
Assumptions
- Competent programmer hypothesis: Only syntactical faults
- Coupling hypothesis: Correlation of syntactic faults with other kinds of faults
Assessing selection criteria: comparison of failure detection ability of
- implementations wrt. single spec.
- mutations of implementation wrt. single spec.
- implementations wrt. respective spec.
- Real or seeded faults

Regression Testing

Retest-all: Expensive at scale
Increase of cost effectiveness of regression tests
- Test prioritization
  - Ordering of tests according to selected surrogate property (coverage, historical failures)
  - Evaluation
    - Average percentage of fault detection (APFD)
    - Cost aware APFD
  - No one-fits-all surrogate
- Test suite minimization
  - Pruning of redundant test cases
  - Evaluation
    - Size / execution time reduction
    - Reduction in fault-detection effectiveness
- Test selection (RTS)
  - Execution of only relevant tests
  - Evaluation
    - Inclusiveness / Safety: false negatives
    - Precision: false positives
  - Expensive
Test optimization approaches
- Dynamic program analysis
  - Focus on execution traces of tests
  - Different hypotheses for different techniques
- Static program analysis
  - Approximation of execution behavior by operating on different static representations of programs (CFGs)
  - Same hypotheses, less precise methods
  - Problem: Reflection
- Predictive techniques
  - Execution history
  - Code authorship
  - Commit size and frequency
  - File extensions

Model-based Testing

Goal: Solving Oracle problem, bad expected output specifications
Explicit model replacement of mental specification model
Test case specification for model-behavior filtering for individual tests
Verification: Comparison of system result with model output
Validation: Comparison of model behavior with specification
Required abstraction from SUT to model
Needed (partial) model of environment
Abstraction approaches
- Encapsulation
- Omission
Driver components as bridges from abstract model and SUT (50% of testing effort)
Scenarios
- Test and code generated from single model
  - Model driven development
  - Assumptions on the environment
  - No testing is possible, system and test equivalent
  - Require redundancy between tests and system
- System generated by model and tests generated from model
  - Extremely expensive
  - Roundtrip problem: Backpropagation of system code changes
- Model for test only
  - Traditional specification to build system
  - Provides redundancy
- Model extraction from code
SUT as black-box, model as white-box
Test selection criteria (for FSM)
- Functional
  - Domain & requirements specific
  - Methodological support
- Ad-Hoc
- Structural
  - Domain independent
  - Data flow, control flow, data
  - Measurable
  - Unclear ability of revealing faults
  - Models of SUT and env.
- Stochastic
  - Random
  - User profiles
- Fault-based
Test Case Generation
- Dedicated algorithms for specific criteria
- Bounded model checking
- Symbolic execution
- Heuristic search
Cost effectiveness
- Little reuse
- Model errors
- Structured tests complement functional tests

Testing OO Software

Inherently specific fault models
Software characteristics impacting testing
- State dependent behavior
- Encapsulation
- Inheritance
- Polymorphism and dynamic binding
- Abstract & generic classes
- Exceptions

Testing strategies

State-based testing
- Interpretation of code as EFSM (FSM + state)
- Harel Statecharts
  - Mealy Machines (FSM with input dependent output) with
    - Hierarchy
    - Parallel composition
    - Broadcast communication
  - Incomplete specifications: Illegal execution exception
- Fault models / Defect testing
  - Missing or incorrect transition: Resulting state is incorrect yet not corrupt
  - Missing or incorrect action: Wrong results of executing transition
  - Extra, missing or corrupt state: UB
  - Sneak-path or illegal message failure
  - Trap-door
  - Composition failures: Conflict between sub- and superclass
- Test approaches
  - Testing of possible faults from transitions and sequences
  - Limit testing for state invariants and transition legality
EFSM-based testing
- Specification into State machine and possibly Statechart
- Deriving of test case operation sequences from state machine
- Requirement of object state access for test oracle output verification
Testing with inheritance
- Fault models
  - Incorrect initialization
  - Inadvertent bindings (shadowing)
  - Naked access
  - Incorrect location of subclass (Liskov)
  - Naughty children: Illegal state for superclass (Liskov)
  - Worm holes: Inconsistent state compared to superclass invariants
  - Spaghetti inheritance
- Testing approaches
  - Hierarchy flattening
  - Test reuse (Testing history) related to superclass
Testing with polymorphism
- Combinatorial explosion problem
- Testing approaches
  - Polymorphic Server test
    - Verification of Liskov compliance
    - Testing of all inheritance related fault models without sequencing constraints
    - Testing of all callee combinations
  - Modal Hierarchy test
    - Flattening of hierarchies
    - Constraints on method sequencing
  - Polymorphic Message test
    - Inverse of Server test
    - Testing of all caller combinations
Testing with exception handlers
- Implicit control flow created by exceptions
- Impracticality to treat exception flow as normal flow due to sheer numbers
- Testing of each handler and utilizing throw / re-throw
- For non-local exception handlers: Enforcing & testing for design rule: methods with throw must be side-effect free

From Failures to Faults

Brute Force: Step-debugging
Binary Search
Delta debugging
- Convergence on fault input with adaptive granularity
- Starts with failing test case top-down and passing test bottom-up regarding some set of inputs
- Granularity change possible through subset and complement
- Algorithm
  1. Split of set into n subsets
  2. Foreach subset: Test complement of subset
    - Pass: Continue
    - Fail
      - split original set into max(2, N - 1) subsets
      - N - 1 to skip previous test on subset complement
      - Repeat loop with new subsets
  3. If no more complements, increase granularity
  4. End cycle if end of granularity increase reached
Fault localization
- A posteriori hints (localization after test results)
  - Based on program spectra
    - Abstraction of program runs
    - Single or multiple fault
    - Collection of data providing specific view of dynamic behavior (block hit counters)
      - Indication of correlation of block hits and failures
      - Similarity coefficients $a_{ij},~i\text{ block hit }, j~\text{failure}$
        
        Jaccard: $\frac{a_{11}}{a_{11}+a_{01}+a_{10}}$
        
        Tarantula: $\frac{\frac{a_{11}}{a_{11}+a_{01}}}{\frac{a_{11}}{a_{11}+a_{01}}+\frac{a_{10}}{a_{10}+a_{00}}}$
        
        AMPLE: $\left|\frac{a_{11}}{a_{11}+a_{01}}-\frac{a_{11}}{a_{10}+a_{00}}\right|$
        
        Ochiai: $\frac{a_{11}}{\sqrt{(a_{11}+a_{01})*(a_{11}+a_{01})}}$
    - Improvements
      - Syntactic block granularity (try, if, method)
      - Tackling confounding bias (dependency between elements)
        
        Re-ranking algorithm taking into account dynamic call graph and explicit data-dependency of failed tests
        
        Recursive inspection of parents of falsely identified faulty element in dynamic call graph
- A priori hints
  - Utilizing data collected on bug-reports and their impact on code-properties
    - Code complexity
    - Problem domain
    - Past history
    - Process quality
  - No a-priori correlation of metrics and faults
  - Based on code metrics
    - Complexity
    - Imports
  - Based on code churn (code changes): Very good indicator
- Positively influencing factors
  - Team membership
  - Local vs distributed teams: insignificant
  - Programming languages
  - Domain
Failure Clustering
- Few faults cause many tests to fail
- Group failures originated from same fault and in investigate representative
- Coverage approach
  - Usage of fault localization techniques to decide cluster number (similarity metrics)
  - Hierarchical clustering of failing test cases by similar coverage profiles
  - Representative as test closest to cluster center
  - 80% reduction in analysis time when using function level granularity
  - 100% of faults from representative-inspection
- Non-coverage approach using code-independent data
  - Failed / passed history
  - Broken / Repaired history
  - Jira history
  - Usage of regression model for clustering
  - 60% reduction in analysis time
  - 80% of faults from representative-inspection