Maximilian Stark (mail@dakror.de), WS2019
Stand: 02.02.2020
Programming Languages
Concurrency: Memory Consistency
Memory Model
- Restriction of implementation freedom for race guarantee
- Establishment of implementation freedom for program- & machine-model optimization
Strict Consistency
- Idealistic model derived from sequential computing intuition
- Guaranteed immediate new read after write, independent of actors
- Physically impossible
- Problematic assumption of global time
Happened-Before Relation
- If event $p$ happened before $q$, then $p \rightarrow q$
- partial
- irreflexive
- transitive
- asymmetric
- $\require{cancel} \cancel{\rightarrow} := \neg (a \rightarrow b)$
- $p,q$ are concurrent if $\require{cancel} p \cancel{\rightarrow} q \wedge q \cancel{\rightarrow} p$
Ordering
- logical clock (function) $C$: assign globally unique timestamp $C(p)$ per event
- Clock condition: $\forall~ p,q: ~ p\rightarrow q \implies C(p) < C(q)$
- total order, embedding strict partial order $\rightarrow$
- set of all $C$ equates to set of possible executions in system
Sequential Consistency
- Same execution result when
- processors run in series
- individual processor actions adhere to internal order (execution path)
- Usage of Happened-Before for limited action interleaving
- Weakened by allowing concurrent access to unrelated data through modeling process per variable
- + Concise representation of all possible interleavings
- + Time-independent synchronization
- Realistic model for simple hardware
- Addition of exclusive memory access
- Computation speed-up through caches
- Issues with complex out-of-order hardware
- Inter-process optimization related pitfalls
MESI Protocol
- Caching to avoid RAM-access
- Necessity for update bookkeeping
- Model for differing execution speeds
- Characterization of well-behaved programs
- Applicability of Happened-Before for fixed program paths and total order between variable accesses
- Insurance of sequential consistency for processors with cache
- Cache states
- I - invalid, for reuse
- S - shared, identical copies exist
- E - exclusive cache data location
- M - exclusive and modified
- Global cache consistency through messages
- *Read: Read request at address
- Read Response: [E, S], response to Read with requested data
- Invalidate: cache line evict broadcast
- Invalidate Acknowledge: confirmation reply to Invalidate
- Read Invalidate: Read + Invalidate (read with intent to modify)
- WriteBack: [M], Read Response, modification broadcast, set sender state to [S], also writes to RAM
- Communication between processors
- > Read Invalidate
- < Read Response / WriteBack, can also come from RAM
- < Invalidate Acknowledge
Out-Of-Order Stores
- Avoid stall-after-write
- Writes stored in temporary store buffer
- Applying of stores (writing of data)
- FIFO (Sparc/x86-TSO)
- unordered (Sparc PSO)
- Continuous local observation of program order
- Overriding of read values by newest store value
TSO (total store order)
- total
- memory order ($\sqsubseteq$)
- embedded program order ($\leq$)
- embedded preceding loads
- loaded values determined by latest locally observed write
- not guaranteed: store $\leq$ load $\implies$ store $\sqsubseteq$ load
TSO Example: x86
- FIFO buffers for strong consistency
mfence()
to enforce SC between write & load (slow)
PSO (partial store order)
- total
- embedded fenced stores in program order
- embedded stores to same address in program order
- embedded preceding loads
- loaded values determined by latest locally observed write
- not guaranteed: memory order of stores related to program order
RMO (relaxed memory order)
- Immediate Acknowledge responses
- Invalidation queue
- Delay of MESI message sending until line-invalidation
- Neglecting of invalidation queue for local loads / stores
- embedded fenced loads in program order
- embedded stores to same address in program order
- embedded load-depending operations ($\rightarrow$)
- same-address store $\rightarrow$ load
- register read $\rightarrow$ latest write
- stores in branched blocks $\rightarrow$ branch conditionals
- loaded values determined by latest write
Explicit Synchronization: Memory barriers
- most low-level way of synchronization
- Write Barrier
- Manually inserted
- Forcing current stores flush
sfence()
- Read Barrier
- Mitigate ignorance of local invalidate queue
- Avoidance of out-of-date read
- Forcing invalidate queue flush
lfence()
- Insertion before every read → SC (slow)
- Effective usage: read barrier in process B for every write barrier in process A
- + no thread de-scheduling when blocking
- + coordination of processes with synchronized variables
- - difficult to get right
- - often slower than locks
- - many fences costly if buffers/inv.queue are bottleneck
Dekker Algorithm
- Mutual process exclusion by busy waiting
- Messages
- bool[] flags (locks)
- int turn
- Dependence on SC → RMO: Memory barriers for all common variables
Future Many-Core Systems
- non-uniform memory access (NUMA):
- partitioning of memory amongst CPUs
- lookup-directory
- possible partial processor interconnectedness
- Intel MESIF (forward)
- Reduction of communication overhead: read response bus congestion
- Local bus, global point-to-point links
Atomicity and Locks
- Common atomic instructions
- exchange of memory cell with register
- compare-and-swap of register with memory cell:
r=(k==i); if(r) i=j;
- fetch-and-add on integers in memory
- modify-and-test on bits in memory
- Wait-Free updates: Locking of cache bus for instruction (ASM
lock
)
atomic
-blocks: made-up keyword for usage of corresponding atomic instructions
- Wait-Free synchronization
- No thread starvation possible
- no blocking
- no failure
- no control flow possible
- usage of conditional instructions
- limited to single atomic instruction
- common building blocks for algorithms that can fail
- Lock-free algorithms
- thread starvation possible
- no blocking
- failure possible
- General recipe
- given compare-and-swap for $n$ bytes ($n=8$ on Intel)
- group invariant-related variables into $n$ bytes
- read bytes atomically
- compute new value
- compare-and-swap (validation against concurrent modification)
- New value must be repeatable / pure
- Used as building blocks for locks
Lock structures
- Locks: Singular entry to CS
- Semaphore: Resource availability counter
wait()
: Wait until able to atomically access and reduce resource counter
- Access granted
signal()
: Atomically increase resource counter
- Mutex: = Binary Semaphore, protect single resource
- Monitor
- automated Mutex, acquire on enter, release on exit of function
- solution for recursive calls: ignore if current thread already acquired
- keep count of recursive calls to release at right time
- = Java's synchronized method
- Condition Variables
- improvement over monitors: signaling across thread boundaries to avoid spin-locks
wait()
- called inside monitor
- briefly releases and blocks
- upon
signal()
, re-acquires and returns
signal()
/ notify()
: wake-up call for waiting threads
- signal-and-urgent-wait: signaling thread suspends until signaled completes
- suspension queue has higher priority then new-entering thread queue
- signal-and-continue: signaling thread continues, enter on monitor release
- no separate suspension queue
- signal is called notify
- notify ~= notify some threads (not just the callee)
- notifyAll ~= notify all threads
Deadlocks
- Scenarios of deadlocks (Coffman et al)
- Mutual exclusion
- Wait for more resources
- No preemption: can't take resources away
- Circular waiting cycle of threads
- Only avoided by proper design
- Lock-free algorithms
- Locking algorithms with partially ordered locks
- Lock sets $L$
- $\lambda(p)\subseteq L$ lock set at program point $p$
- Lock order $\triangleleft\subseteq L \times L;~ l\triangleleft l' \iff l\in\lambda(p); ~ \prec ~=~ \triangleleft^+$
- Transitive closure $\sigma \subseteq X \times X;~ \sigma^+ = \cup_{i\in\mathbb{N}} \sigma^i ;~ \sigma^0 = \sigma;~\sigma^{i+1} = \{\langle x_1,x_3 \rangle ~|~ \exists x_2 \in X . \langle x_1,x_2 \rangle\in \sigma^i \wedge \langle x_2, x_3 \rangle \in \sigma^i \} \cup \sigma^i$
- Freedom of deadlock: $\nexists a\in L: a \prec a$
- Freedom of deadlock for monitors
- $L = L_{\text{Semaphores}} \cup L_{\text{Monitors}}$
- $(\forall a \in L_S. a \nprec a) \wedge (\forall a \in L_M, b\in L. a \prec b \wedge b \prec a \implies a = b)$
- Static deadlock avoidance guarantee
- Summary of locs / monitors with multiple instances into single $\overline{a}\in L_M$
- Requirement of $\forall \overline{a} . \overline{a} \nprec \overline{a}$
Transactional Memory
- Zombie Transaction: Transactions propagating inconsistent state
- Opacity: Ability for failing transactions to see consistent view of memory
- Single Lock atomicity (SLA)
- Execution of transactions as if globally shared single lock to control flow
- Weaker progress guarantee than mere transactions
- Correctness requirement too strong in practice
- atomic acts as barrier
- Transactional sequential consistency (TSC)
- strong isolation but permission of concurrent execution of transactions
- prevents reordering of accesses within transaction
- In practice: allowance of race-free reordering
ACID
- Atomicity
- Consistency
- Isolation
- Strong ~: Retaining order of accesses to TM and non-TM
- Weak ~: Memory guarantees only within atomic blocks
- Classic race condition scenario
- Durability
Software Transactional Memory (STM)
- Translation of atomic blocks into transactions
- Storage of transaction descriptor: atomic-block meta-data
- undo-log: re-roll on failure
- redo-log: postponed until commit
- read- & write-set: locations accessed
- read- & write-version: timestamps of accesses
- Example: TL2 STM implementation
- provides opacity
- lazy versioning: postponing of all writes using redo-log
- validating conflict detection: access of modified address aborts transaction
- can cause unnecessary aborts due to non-atomicity of individual accesses
- comparison of data-timestamps
- still requirement of locks for writing values → danger of deadlock
- can cause contention on commit
- improvement: collect new data in object, atomically commit pointer exchange
- can cause contention on global clock
- Integration of non-TM resources
- Prohibiting: Rejection by compiler
- Execution: I/O Might only happen on some runs, abort if direct I/O happens (usually buffered)
- Irrevocable Execution: Universal way for no undo-able operations: cause conflict in all other transactions to guarantee termination
- Integration: Re-write code to be transactional
Hardware Transactional Memory (HTM)
- Additional hardware to track read- & write-sets
- Requirement of software backup in scenarios exceeding hardware capabilities
- eager conflict detection using dedicated hardware
- separate transaction instructions to distinguish from STM
- Problematic mixing of TM and non-TM instructions
- Implicit Transactional Memory
- Marking of only beginning and end of transaction
- Transparent usage of regular instructions
- Only difference: caching and transactional processing of memory accesses
- Hardware / OS access abort transaction
- Ensures strong isolation
- Examples
- AMD Advances Synchronization Facilities (ASF)
- Definition of logical speculative region
LOCK MOV
explicit data transfer between normal memory and speculative region
- Used to implement larger atomic operations
- Intel TSX in Skylake
- Ability to use regular instructions with transactions
- Tracking of read / write sets using single transaction bit on cache line
- Storage space for backup of complete CPU state
- simple counter to support nested transactions
- spontaneous aborts in case of missing resources possible
- abort of inner transaction aborts whole tree
- Software interfaces provided
- Restricted Transactional Memory (RTM)
- Hardware Lock Elision (HLE)
Restricted Transactional Memory (RTM)
- No progress guarantee
xbegin
- increment counter $C$
- backup registers
- flush store buffer
- subsequent reads / writes set transaction bit $T$ if $C > 0$
- abort on Invalidate message and already set $T$ flag
- abort on observed read on a modified cache line with set $T$ flag
- available through library function
_xbegin()
- Requirement of fallback code in case of failure
- Isolation from transaction
- use mutex in fallback code
- verify free mutex inside transaction, otherwise abort
- Lock can be avoided through transaction → Lock Elision
- Atomic execution
xabort
- Clear all $T$ flags
- Clear store buffer
- invalidate all former TM lines
- Set $C = 0$
- Restore CPU from backup
xend
- Decrement $C$
- If $C = 0$: clear all $T$ flags, flush store buffer
xtest
: test if currently inside transaction
Hardware Lock Elision (HLE)
- Can be implemented in software using RTM
- Deferring of lock acquisition by default
- Relying on HTM to sort out concurrency conflicts
- In case of persisting conflicts fallback to locks
- Support of legacy lock code by imitating semaphore value change
- Usage of store buffer analogous buffer for elided locks
- Special annotations around lock instructions required
xacquire
- Prefix for semaphore-increment instructions
- Ensures shared / exclusive cache line with $T$
- Issues
xbegin
- Stores modified lock value in elided lock buffer
xrelease
- Prefix for semaphore reset instructions
- Same effects as
xend
- Flush elided locks buffer
- For full elision all wait & signal need to be annotated
Outlook on TM
- Non-blocking message passing (actor model)
- Blocking message passing (CSP, $pi$-Calculus)
- (immediate) priority ceiling
Actual Programming Language Topics
Dispatching
- Single dispatching
- Matching most specific method
- Searching at runtime for dynamic methods
- Multi-dispatching
- Overloads with different parameter types legal
- Multi-methods: generation of overloads
- Additional overhead at runtime
- Increased ambiguity risk and runtime failure
Inheritance Schools
- Implementation inheritance
- Code re-usage
- Child as specialization using library-esque parents
- C++, Lisp
- Interface inheritance
- Behavior contracts
- Providing of methods defined by the parent
- Parent as generic framework
- Java, C#
- Smalltalk inheritance
- Children dominate parents
- established reference to parent
- Java, C#
- Beta inheritance
- Parent dominates children
- Giving up control to children using explicit keyword
inner
- Children extend using explicit keyword
extension
Inheritance formalized
- Symmetric join $a \sqcup b$
- Asymmetric join $a ~\mathrm{asym}~ b$: prefer first if exists
- Smalltalk inheritance: $a \triangleright b : C \times C \mapsto C = {super \mapsto b} ~\mathrm{asym}~ (a ~\mathrm{asym}~ b)$
Inheritance Implementation
- Caching of dispatcher resolution (Hotspot JVM)
- Precomputation: Virtual tables
- Collection of function pointers to virtual method sites
- Components
- Offset to top of enclosing memory object (for multiple inheritance)
- RTTI data
- Signature to function pointer mapping
- Differences in multiple inheritance
- Casted pointers are modified to cast-target type
- Thunks: trampoline methods delegating virtual call to base function in parent with adapted
this
-pointer on vtable lookup mismatch
- Standard C++ behavior for common ancestors: Duplication of representations → Can result in ambiguity
- Optional behavior
- shared ancestor representation (
virtual
in extends
statement) → Virtual Base Class, at bottom of object in memory
- Additional offset to virtual base in vtable components
- Need for
dynamic_cast
due to variable offset inside vtable
- Grows quickly out of control → Need for vtable for vtable
- Virtual base pointer
- virtual call offsets for adjusting
this
-pointer dynamically
- offset to top
- RTTI
- virtual function pointers
- virtual Thunks
- Ambiguity resolution in multiple inheritance
- Linearization through compiler-made Memory Resolution Order (MRO)
- Best effort approach
- Core principles
- Inheritance Relation: $C(A,B) \implies C ~\mathrm{inherits}~ A \wedge C ~\mathrm{inherits}~ B$
- Multiplicity Relation: $C(A,B) \implies A \rightarrow B$ ($\rightarrow$ creates total order among multiple parents)
- Leftmost Preorder DFS (LPDFS)
- violation of principle 1
- used in legacy python ($\leq 2.1$)
- Example: $A(B,C);~B(W);~C(W):~ L[A] = ABWC$
- LPDFS with Duplicate Cancellation
- Flow: $ABWCW$ → Cancellation of first $W$
- Example 1: $A(B,C);~B(W);~C(W):~ L[A] = ABCW$
- Example 2: $A(B,C);~B(V,W);~C(W,V):~ L[A] = ABCWV$
- Principle 2 unfulfillable
- Conflict $B \rightarrow C \implies W \rightarrow V$
- Reverse Postorder Rightmost DFS (RPRDFS)
- Example 1: $A(B,C);~B(F,D);~C(E,H);~D(G);~E(G);~F(W);~G(W);~H(W):~L[A]=ABFDCEGHW$
- Refinement for conflict resolution: Addition of conflicting edge and re-run
- No guarantee about monotonicity
- Monotonicity Principle: Enforce same order among parents for recurring elements in any child's parents in the tree
- C3 Linearization
- Detection of monotonicity violation
- $L[C] = C \cdot \bigsqcup (L[B_1],...,L[B_n],B_1 \cdot ... \cdot B_n) ~|~ C(B_1,...,B_n)$, $B$ superclasses of $C$
- $\bigsqcup_i(L_i)= (\exists_{\min k} \forall j. ~ c = \mathrm{head}(L_k) \not\in \mathrm{tail}(L_j)) ~?~ (c \cdot (\bigsqcup_i(L_i \setminus c))) ~:~ \mathrm{FAIL}$
- $\mathrm{head}$ iteration from left to right through given list
- $\mathrm{tail}$ iteration from right to left through given list
Mixins
- Formally
- $mixin(c) : C \mapsto (C \mapsto C) = \lambda x . c \triangleright x$
- Mixin-Composition: $mixin \circ mixin$
- Concept to reuse code more efficiently
- native support for
super
- Composable
- Hassle-free alternative to multiple inheritance
- Linearization requirement: order is important
- Sequential joining with class
- Approximation using C++
- Realized through templated superclasses
- Not modular enough type system
- Example language: Ruby
Traits
- Alternative to often misused Implementation Inheritance (Liskov Substitution Principle)
- Equivant to a class without attributes
- Separation of object creation from modeling hierarchies and functionality composition
- Interfaces to model hierarchy
- Traits to assemble functionality
- Classes as frames for entities
- Support for method exclusion and aliasing for conflict resolution
- Composition principles
- Flat ordering
- Same precedence of all Traits
- Manual conflict resolution
- Precedence: Class methods prioritized over Traits
- Flattening
- After asymmetric join: non-overridden Trait methods turn into class methods
- Approximations
- C# Extension methods
- Static first parameter
this
, drawing in target object
- No flattening
- Only static scope
- Boilerplate empty interface definitions
- Cannot override abstract signatures
- Java Virtual Extension Methods (Interface default)
- additional interface phase in invokevirtual name resolution
- Still no support for overriding abstract methods
- Example language: Squeak
Prototype-based Languages
- Lua
- Everything is a table
- Every function is static
- Magic index lookup function for delegation
- Meta tables
- OOP through identify parameter (self) for functions
Aspect Oriented Programming
- Separation of (cross-cutting) concerns
- Security
- Logging
- Error handling
- Validation
- Profiling
- Weaving: fuzzy combination of aspects to form code modules
- Example: AspectJ
- Cross-cutting types
- Static: Adding common external definitions for multiple classes
- Dynamic
- well-defined join points in control flow of program
- Method call & execution
- field get & set
- exception handler execution
- class init
- object init
- Type based
- target(Type)
- within(Type)
- withincode(Method)
- Flow / State based
- Pointcut: Predicate for join point execution
- Pointcut designator: definition of pointcut by programmer
- Advice: method-like construct used to define additional behavior at join points
- before()
- after()
- after() returning()
- after() throwing()
- Binding of parameters: args()
- around(): Wrapping of join point
- need to specify wrapping location with proceed()
- Scope-issues
- Expiration of proceed() scope
- Need for shadow-classes: Cloning of proceed() context
- Determining order of execution
- between different aspects
- need for precedence declaration
- following aspect hierarchy
- undefined
- within same aspect
- the last-most later()
- the first definition
Metaprogramming
- Code-Generation
- Preprocessor
- Homoiconic: Code = Data
- Clojure: Runtime metaprogramming
- Macros as static AST transformations, uninterpreted parameters
- functions as runtime control flow manipulations
- Macro Hygiene: Avoiding variable shadowing from Macro-generated code
- Reflection
- Metaobject Protocol
- LISP: CLOS
- Interface to manipulate systems and code structures
Generators and Exceptions
- Stack traversal
- goto
- setjmp & longjmp: Jump across procedure boundary, storing current context
- Stack unwinding
- Destruction of stack-elements at scope end
- Table of catchable exceptions per method and cleanup table
- Back-traversal of stack on throw in search for handler
- Personality routine: Cleanup checker per function
- Context Switching
- makecontext & swapcontext
- Need to create stack and register for new context
- Acquisition of current context for initialization: getcontext
- Possibility to register successor context, called in swapcontext
- Coroutines
- Stackless: EcmaScript 6+
- Stackful: Lua
- Continuations
- Haskell
- Context abstraction as first-level language feature
- Continuation Passing Style (CPS)
- Instead of directly returning, a supplied function is called with computed result
- Included in method signature
- Direct:
square :: Int -> Int
- CPS:
square :: Int -> ((Int -> r) -> r)
- Cont Type constructor
- Cont as suspended computation data type
cont :: ((a -> r) -> r) -> Cont r a
suspended computation from CPS function
runCont :: Cont r a -> (a -> r) -> r
computation of suspended computation with given final function
- Monads
- type class for arbitrary type constructors
- Definition of return method
- Definition of combinator function bind (
>>=
)
- Do-Notation: syntactic sugar for imperative-like code
- Cont-Monad
return x = cont (\k -> k x)
s >>= f = cont (\k -> runCont s (\x -> runCont (f x) k))
- Tail call optimization
- Reusing of existing stack frame and parameters
- Jump to called function instead of creating new stack frame and such