Maximilian Stark (mail@dakror.de), WS2019

Stand: 02.02.2020

Programming Languages

Concurrency: Memory Consistency

Memory Model

Restriction of implementation freedom for race guarantee
Establishment of implementation freedom for program- & machine-model optimization

Strict Consistency

Idealistic model derived from sequential computing intuition
Guaranteed immediate new read after write, independent of actors
Physically impossible
Problematic assumption of global time

Happened-Before Relation

If event $p$ happened before $q$, then $p \rightarrow q$
partial
irreflexive
transitive
asymmetric
$\require{cancel} \cancel{\rightarrow} := \neg (a \rightarrow b)$
$p,q$ are concurrent if $\require{cancel} p \cancel{\rightarrow} q \wedge q \cancel{\rightarrow} p$

Ordering

logical clock (function) $C$: assign globally unique timestamp $C(p)$ per event
Clock condition: $\forall~ p,q: ~ p\rightarrow q \implies C(p) < C(q)$
total order, embedding strict partial order $\rightarrow$
set of all $C$ equates to set of possible executions in system

Sequential Consistency

Same execution result when
- processors run in series
- individual processor actions adhere to internal order (execution path)
Usage of Happened-Before for limited action interleaving
Weakened by allowing concurrent access to unrelated data through modeling process per variable
+ Concise representation of all possible interleavings
+ Time-independent synchronization
Realistic model for simple hardware
- Addition of exclusive memory access
- Computation speed-up through caches
Issues with complex out-of-order hardware
- Inter-process optimization related pitfalls

MESI Protocol

Caching to avoid RAM-access
Necessity for update bookkeeping
Model for differing execution speeds
Characterization of well-behaved programs
Applicability of Happened-Before for fixed program paths and total order between variable accesses
Insurance of sequential consistency for processors with cache
Cache states
- I - invalid, for reuse
- S - shared, identical copies exist
- E - exclusive cache data location
- M - exclusive and modified
Global cache consistency through messages
- *Read: Read request at address
- Read Response: [E, S], response to Read with requested data
- Invalidate: cache line evict broadcast
- Invalidate Acknowledge: confirmation reply to Invalidate
- Read Invalidate: Read + Invalidate (read with intent to modify)
- WriteBack: [M], Read Response, modification broadcast, set sender state to [S], also writes to RAM
Communication between processors
1. > Read Invalidate
2. < Read Response / WriteBack, can also come from RAM
3. < Invalidate Acknowledge

Out-Of-Order Stores

Avoid stall-after-write
Writes stored in temporary store buffer
Applying of stores (writing of data)
- FIFO (Sparc/x86-TSO)
- unordered (Sparc PSO)
Continuous local observation of program order
- Overriding of read values by newest store value

TSO (total store order)

total
memory order ($\sqsubseteq$)
embedded program order ($\leq$)
embedded preceding loads
loaded values determined by latest locally observed write
not guaranteed: store $\leq$ load $\implies$ store $\sqsubseteq$ load

TSO Example: x86

FIFO buffers for strong consistency
mfence() to enforce SC between write & load (slow)

PSO (partial store order)

total
embedded fenced stores in program order
embedded stores to same address in program order
embedded preceding loads
loaded values determined by latest locally observed write
not guaranteed: memory order of stores related to program order

RMO (relaxed memory order)

Immediate Acknowledge responses
Invalidation queue
Delay of MESI message sending until line-invalidation
Neglecting of invalidation queue for local loads / stores
embedded fenced loads in program order
embedded stores to same address in program order
embedded load-depending operations ($\rightarrow$)
- same-address store $\rightarrow$ load
- register read $\rightarrow$ latest write
- stores in branched blocks $\rightarrow$ branch conditionals
loaded values determined by latest write

Explicit Synchronization: Memory barriers

most low-level way of synchronization
Write Barrier
- Manually inserted
- Forcing current stores flush
- sfence()
Read Barrier
- Mitigate ignorance of local invalidate queue
- Avoidance of out-of-date read
- Forcing invalidate queue flush
- lfence()
- Insertion before every read → SC (slow)
Effective usage: read barrier in process B for every write barrier in process A
+ no thread de-scheduling when blocking
+ coordination of processes with synchronized variables
- difficult to get right
- often slower than locks
- many fences costly if buffers/inv.queue are bottleneck

Dekker Algorithm

Mutual process exclusion by busy waiting
Messages
- bool[] flags (locks)
- int turn
Dependence on SC → RMO: Memory barriers for all common variables

Future Many-Core Systems

non-uniform memory access (NUMA):
- partitioning of memory amongst CPUs
- lookup-directory
- possible partial processor interconnectedness
Intel MESIF (forward)
Reduction of communication overhead: read response bus congestion
Local bus, global point-to-point links

Atomicity and Locks

Common atomic instructions
- exchange of memory cell with register
- compare-and-swap of register with memory cell: r=(k==i); if(r) i=j;
- fetch-and-add on integers in memory
- modify-and-test on bits in memory
Wait-Free updates: Locking of cache bus for instruction (ASM lock)
atomic-blocks: made-up keyword for usage of corresponding atomic instructions
Wait-Free synchronization
- No thread starvation possible
- no blocking
- no failure
- no control flow possible
- usage of conditional instructions
- limited to single atomic instruction
- common building blocks for algorithms that can fail
  - set-and-test: a=b; b=0;
Lock-free algorithms
- thread starvation possible
- no blocking
- failure possible
- General recipe
  1. given compare-and-swap for $n$ bytes ($n=8$ on Intel)
  2. group invariant-related variables into $n$ bytes
  3. read bytes atomically
  4. compute new value
  5. compare-and-swap (validation against concurrent modification)
- New value must be repeatable / pure
- Used as building blocks for locks

Lock structures

Locks: Singular entry to CS
Semaphore: Resource availability counter
1. wait(): Wait until able to atomically access and reduce resource counter
2. Access granted
3. signal(): Atomically increase resource counter
Mutex: = Binary Semaphore, protect single resource
Monitor
- automated Mutex, acquire on enter, release on exit of function
- solution for recursive calls: ignore if current thread already acquired
- keep count of recursive calls to release at right time
- = Java's synchronized method
Condition Variables
- improvement over monitors: signaling across thread boundaries to avoid spin-locks
- wait()
  - called inside monitor
  - briefly releases and blocks
  - upon signal(), re-acquires and returns
- signal() / notify() : wake-up call for waiting threads
  - signal-and-urgent-wait: signaling thread suspends until signaled completes
    - suspension queue has higher priority then new-entering thread queue
  - signal-and-continue: signaling thread continues, enter on monitor release
    - no separate suspension queue
    - signal is called notify
    - notify ~= notify some threads (not just the callee)
    - notifyAll ~= notify all threads

Deadlocks

Scenarios of deadlocks (Coffman et al)
- Mutual exclusion
- Wait for more resources
- No preemption: can't take resources away
- Circular waiting cycle of threads
Only avoided by proper design
- Lock-free algorithms
- Locking algorithms with partially ordered locks
  - no cycle can be formed
- Lock sets $L$
- $\lambda(p)\subseteq L$ lock set at program point $p$
- Lock order $\triangleleft\subseteq L \times L;~ l\triangleleft l' \iff l\in\lambda(p); ~ \prec ~=~ \triangleleft^+$
- Transitive closure $\sigma \subseteq X \times X;~ \sigma^+ = \cup_{i\in\mathbb{N}} \sigma^i ;~ \sigma^0 = \sigma;~\sigma^{i+1} = \{\langle x_1,x_3 \rangle ~|~ \exists x_2 \in X . \langle x_1,x_2 \rangle\in \sigma^i \wedge \langle x_2, x_3 \rangle \in \sigma^i \} \cup \sigma^i$
- Freedom of deadlock: $\nexists a\in L: a \prec a$
- Freedom of deadlock for monitors
  - $L = L_{\text{Semaphores}} \cup L_{\text{Monitors}}$
  - $(\forall a \in L_S. a \nprec a) \wedge (\forall a \in L_M, b\in L. a \prec b \wedge b \prec a \implies a = b)$
- Static deadlock avoidance guarantee
  - Summary of locs / monitors with multiple instances into single $\overline{a}\in L_M$
  - Requirement of $\forall \overline{a} . \overline{a} \nprec \overline{a}$

Transactional Memory

Zombie Transaction: Transactions propagating inconsistent state
Opacity: Ability for failing transactions to see consistent view of memory
Single Lock atomicity (SLA)
- Execution of transactions as if globally shared single lock to control flow
- Weaker progress guarantee than mere transactions
- Correctness requirement too strong in practice
- atomic acts as barrier
Transactional sequential consistency (TSC)
- strong isolation but permission of concurrent execution of transactions
- prevents reordering of accesses within transaction
- In practice: allowance of race-free reordering

ACID

Atomicity
Consistency
Isolation
- Strong ~: Retaining order of accesses to TM and non-TM
- Weak ~: Memory guarantees only within atomic blocks
  - Classic race condition scenario
Durability

Software Transactional Memory (STM)

Translation of atomic blocks into transactions
Storage of transaction descriptor: atomic-block meta-data
- undo-log: re-roll on failure
- redo-log: postponed until commit
- read- & write-set: locations accessed
- read- & write-version: timestamps of accesses
Example: TL2 STM implementation
- provides opacity
- lazy versioning: postponing of all writes using redo-log
- validating conflict detection: access of modified address aborts transaction
  - can cause unnecessary aborts due to non-atomicity of individual accesses
- comparison of data-timestamps
- still requirement of locks for writing values → danger of deadlock
  - can cause contention on commit
  - improvement: collect new data in object, atomically commit pointer exchange
- can cause contention on global clock
Integration of non-TM resources
- Prohibiting: Rejection by compiler
- Execution: I/O Might only happen on some runs, abort if direct I/O happens (usually buffered)
- Irrevocable Execution: Universal way for no undo-able operations: cause conflict in all other transactions to guarantee termination
- Integration: Re-write code to be transactional

Hardware Transactional Memory (HTM)

Additional hardware to track read- & write-sets
Requirement of software backup in scenarios exceeding hardware capabilities
eager conflict detection using dedicated hardware
separate transaction instructions to distinguish from STM
Problematic mixing of TM and non-TM instructions
Implicit Transactional Memory
- Marking of only beginning and end of transaction
- Transparent usage of regular instructions
- Only difference: caching and transactional processing of memory accesses
- Hardware / OS access abort transaction
- Ensures strong isolation
Examples
- AMD Advances Synchronization Facilities (ASF)
  - Definition of logical speculative region
  - LOCK MOV explicit data transfer between normal memory and speculative region
  - Used to implement larger atomic operations
- Intel TSX in Skylake
  - Ability to use regular instructions with transactions
  - Tracking of read / write sets using single transaction bit on cache line
  - Storage space for backup of complete CPU state
  - simple counter to support nested transactions
  - spontaneous aborts in case of missing resources possible
  - abort of inner transaction aborts whole tree
  - Software interfaces provided
    - Restricted Transactional Memory (RTM)
    - Hardware Lock Elision (HLE)

Restricted Transactional Memory (RTM)

No progress guarantee
xbegin
- increment counter $C$
- backup registers
- flush store buffer
- subsequent reads / writes set transaction bit $T$ if $C > 0$
- abort on Invalidate message and already set $T$ flag
- abort on observed read on a modified cache line with set $T$ flag
- available through library function _xbegin()
  - Requirement of fallback code in case of failure
    - Isolation from transaction
      - use mutex in fallback code
      - verify free mutex inside transaction, otherwise abort
      - Lock can be avoided through transaction → Lock Elision
    - Atomic execution
xabort
- Clear all $T$ flags
- Clear store buffer
- invalidate all former TM lines
- Set $C = 0$
- Restore CPU from backup
xend
- Decrement $C$
- If $C = 0$: clear all $T$ flags, flush store buffer
xtest: test if currently inside transaction

Hardware Lock Elision (HLE)

Can be implemented in software using RTM
Deferring of lock acquisition by default
Relying on HTM to sort out concurrency conflicts
In case of persisting conflicts fallback to locks
Support of legacy lock code by imitating semaphore value change
Usage of store buffer analogous buffer for elided locks
Special annotations around lock instructions required
- xacquire
  - Prefix for semaphore-increment instructions
  - Ensures shared / exclusive cache line with $T$
  - Issues xbegin
  - Stores modified lock value in elided lock buffer
- xrelease
  - Prefix for semaphore reset instructions
  - Same effects as xend
  - Flush elided locks buffer
For full elision all wait & signal need to be annotated

Outlook on TM

Non-blocking message passing (actor model)
Blocking message passing (CSP, $pi$-Calculus)
(immediate) priority ceiling

Actual Programming Language Topics

Dispatching

Single dispatching
- Matching most specific method
- Searching at runtime for dynamic methods
Multi-dispatching
- Overloads with different parameter types legal
- Multi-methods: generation of overloads
- Additional overhead at runtime
- Increased ambiguity risk and runtime failure

Inheritance Schools

Implementation inheritance
- Code re-usage
- Child as specialization using library-esque parents
- C++, Lisp
Interface inheritance
- Behavior contracts
- Providing of methods defined by the parent
- Parent as generic framework
- Java, C#
Smalltalk inheritance
- Children dominate parents
- established reference to parent
- Java, C#
Beta inheritance
- Parent dominates children
- Giving up control to children using explicit keyword inner
- Children extend using explicit keyword extension

Inheritance formalized

Symmetric join $a \sqcup b$
Asymmetric join $a ~\mathrm{asym}~ b$: prefer first if exists
Smalltalk inheritance: $a \triangleright b : C \times C \mapsto C = {super \mapsto b} ~\mathrm{asym}~ (a ~\mathrm{asym}~ b)$

Inheritance Implementation

Caching of dispatcher resolution (Hotspot JVM)
Precomputation: Virtual tables
- Collection of function pointers to virtual method sites
- Components
  1. Offset to top of enclosing memory object (for multiple inheritance)
  2. RTTI data
  3. Signature to function pointer mapping
- Differences in multiple inheritance
  - Casted pointers are modified to cast-target type
  - Thunks: trampoline methods delegating virtual call to base function in parent with adapted this-pointer on vtable lookup mismatch
  - Standard C++ behavior for common ancestors: Duplication of representations → Can result in ambiguity
  - Optional behavior
    - shared ancestor representation (virtual in extends statement) → Virtual Base Class, at bottom of object in memory
    - Additional offset to virtual base in vtable components
    - Need for dynamic_cast due to variable offset inside vtable
    - Grows quickly out of control → Need for vtable for vtable
      - Virtual base pointer
      - virtual call offsets for adjusting this-pointer dynamically
      - offset to top
      - RTTI
      - virtual function pointers
      - virtual Thunks
Ambiguity resolution in multiple inheritance
- Linearization through compiler-made Memory Resolution Order (MRO)
  - Best effort approach
  - Core principles
    1. Inheritance Relation: $C(A,B) \implies C ~\mathrm{inherits}~ A \wedge C ~\mathrm{inherits}~ B$
    2. Multiplicity Relation: $C(A,B) \implies A \rightarrow B$ ($\rightarrow$ creates total order among multiple parents)
  - Leftmost Preorder DFS (LPDFS)
    - violation of principle 1
    - used in legacy python ($\leq 2.1$)
    - Example: $A(B,C);~B(W);~C(W):~ L[A] = ABWC$
  - LPDFS with Duplicate Cancellation
    - Flow: $ABWCW$ → Cancellation of first $W$
    - Example 1: $A(B,C);~B(W);~C(W):~ L[A] = ABCW$
    - Example 2: $A(B,C);~B(V,W);~C(W,V):~ L[A] = ABCWV$
      - Principle 2 unfulfillable
      - Conflict $B \rightarrow C \implies W \rightarrow V$
  - Reverse Postorder Rightmost DFS (RPRDFS)
    - Example 1: $A(B,C);~B(F,D);~C(E,H);~D(G);~E(G);~F(W);~G(W);~H(W):~L[A]=ABFDCEGHW$
    - Refinement for conflict resolution: Addition of conflicting edge and re-run
    - No guarantee about monotonicity
      - Monotonicity Principle: Enforce same order among parents for recurring elements in any child's parents in the tree
  - C3 Linearization
    - Detection of monotonicity violation
    - $L[C] = C \cdot \bigsqcup (L[B_1],...,L[B_n],B_1 \cdot ... \cdot B_n) ~|~ C(B_1,...,B_n)$, $B$ superclasses of $C$
    - $\bigsqcup_i(L_i)= (\exists_{\min k} \forall j. ~ c = \mathrm{head}(L_k) \not\in \mathrm{tail}(L_j)) ~?~ (c \cdot (\bigsqcup_i(L_i \setminus c))) ~:~ \mathrm{FAIL}$
    - $\mathrm{head}$ iteration from left to right through given list
    - $\mathrm{tail}$ iteration from right to left through given list

Mixins

Formally
- $mixin(c) : C \mapsto (C \mapsto C) = \lambda x . c \triangleright x$
- Mixin-Composition: $mixin \circ mixin$
Concept to reuse code more efficiently
native support for super
Composable
Hassle-free alternative to multiple inheritance
Linearization requirement: order is important
Sequential joining with class
Approximation using C++
- Realized through templated superclasses
- Not modular enough type system
Example language: Ruby

Traits

Alternative to often misused Implementation Inheritance (Liskov Substitution Principle)
Equivant to a class without attributes
Separation of object creation from modeling hierarchies and functionality composition
- Interfaces to model hierarchy
- Traits to assemble functionality
- Classes as frames for entities
Support for method exclusion and aliasing for conflict resolution
Composition principles
- Flat ordering
  - Same precedence of all Traits
  - Manual conflict resolution
- Precedence: Class methods prioritized over Traits
- Flattening
  - After asymmetric join: non-overridden Trait methods turn into class methods
Approximations
- C# Extension methods
  - Static first parameter this, drawing in target object
  - No flattening
  - Only static scope
  - Boilerplate empty interface definitions
  - Cannot override abstract signatures
- Java Virtual Extension Methods (Interface default)
  - additional interface phase in invokevirtual name resolution
  - Still no support for overriding abstract methods
Example language: Squeak

Prototype-based Languages

Lua
Everything is a table
Every function is static
Magic index lookup function for delegation
Meta tables
OOP through identify parameter (self) for functions

Aspect Oriented Programming

Separation of (cross-cutting) concerns
- Security
- Logging
- Error handling
- Validation
- Profiling
Weaving: fuzzy combination of aspects to form code modules
Example: AspectJ
Cross-cutting types
- Static: Adding common external definitions for multiple classes
- Dynamic
  - well-defined join points in control flow of program
    - Method call & execution
    - field get & set
    - exception handler execution
    - class init
    - object init
    - Type based
      - target(Type)
      - within(Type)
      - withincode(Method)
    - Flow / State based
      - cflow()
      - if()
  - Pointcut: Predicate for join point execution
  - Pointcut designator: definition of pointcut by programmer
  - Advice: method-like construct used to define additional behavior at join points
    - before()
    - after()
    - after() returning()
    - after() throwing()
    - Binding of parameters: args()
    - around(): Wrapping of join point
      - need to specify wrapping location with proceed()
      - Scope-issues
        
        Expiration of proceed() scope
        
        Need for shadow-classes: Cloning of proceed() context
    - Determining order of execution
      - between different aspects
        
        need for precedence declaration
        
        following aspect hierarchy
        
        undefined
      - within same aspect
        
        the last-most later()
        
        the first definition

Metaprogramming

Code-Generation
Preprocessor
Homoiconic: Code = Data
- Clojure: Runtime metaprogramming
  - Macros as static AST transformations, uninterpreted parameters
  - functions as runtime control flow manipulations
  - Macro Hygiene: Avoiding variable shadowing from Macro-generated code
- Reflection
- Metaobject Protocol
  - LISP: CLOS
  - Interface to manipulate systems and code structures

Generators and Exceptions

Stack traversal
- goto
- setjmp & longjmp: Jump across procedure boundary, storing current context
Stack unwinding
- Destruction of stack-elements at scope end
- Table of catchable exceptions per method and cleanup table
- Back-traversal of stack on throw in search for handler
- Personality routine: Cleanup checker per function
Context Switching
- makecontext & swapcontext
- Need to create stack and register for new context
- Acquisition of current context for initialization: getcontext
- Possibility to register successor context, called in swapcontext
Coroutines
- Stackless: EcmaScript 6+
- Stackful: Lua
Continuations
- Haskell
- Context abstraction as first-level language feature
- Continuation Passing Style (CPS)
  - Instead of directly returning, a supplied function is called with computed result
  - Included in method signature
    - Direct: square :: Int -> Int
    - CPS: square :: Int -> ((Int -> r) -> r)
  - Cont Type constructor
    - Cont as suspended computation data type
    - cont :: ((a -> r) -> r) -> Cont r a suspended computation from CPS function
    - runCont :: Cont r a -> (a -> r) -> r computation of suspended computation with given final function
  - Monads
    - type class for arbitrary type constructors
    - Definition of return method
    - Definition of combinator function bind (>>=)
    - Do-Notation: syntactic sugar for imperative-like code
    - Cont-Monad
      - return x = cont (\k -> k x)
      - s >>= f = cont (\k -> runCont s (\x -> runCont (f x) k))
Tail call optimization
- Reusing of existing stack frame and parameters
- Jump to called function instead of creating new stack frame and such