This blog is based on course materials by Dr. Gerald Weber. I find it hard to focus on learning these concepts, so I decide drive myself with some output.

Dimensions of Version Control

VC system

Version Control System has two spaces:

  • Product space, for the code files.
  • Meta space, for the version histories.

    Central Version Control Systems vs. Distributed Version Control Systems:

    DVCS allows work offline and enables different workflows.

    Best Practices

    Coding Style

    Need to decide on a code style among the team. Use style checker if possible/ configure your IDE.
    Side note: prettier is an opinionated code formatter that enforces a consistent style by parsing your code and re-printing it with its own rules. You can have your own configurations.

    Commits

    Single Logical Changes

    So that changes can be better adopted/cherry picked, and easier to review.

    Don’t break the build

    Only commit changes that preserve system integrity.
    Avoid commits that fixes other commits, e.g. “Add changes missing from previous commit”: Rework history instead.
    See the fixup commit argument to fix previous commits:
    1
    2
    3
    $ git add ...                           # Stage a fix
    $ git commit --fixup=a0b1c2d3 # Perform the commit to fix broken a0b1c2d3
    $ git rebase -i --autosquash a0b1c2d3~1 # Now merge fixup commit into broken commit

    Avoid Regressions

    run unit tests before committing
    keep test suite up-to-date
    communicate about manual testing

    Non critical cleanup

    Keep noncritical cleanup/reformatting separate from functional changes.

    Refactorings in single commit

    Similarly, refactoring does not expect to change functions, but influences a lot of code. Communicate about refactoring so that everyone has the up-to-date version.

    Merge global changes early

    All branches might need to consolidate and prepare for merging the global cleanup/refactoring changes. Merge them early to avoid hard conflicts.

    Communicate with other developers

    Regular review and cleanup activities.
    Discuss and agree on a design.
    Check if someone else is working on the same part to avoid conflicts.

    Fix hostpots for conflicts

    If some part of the code often cause merge conflicts, it might indicate important design flaw/issue: not decoupled with other modules enough, too much dependencies, etc.

Merge Commit

Merge should only contain changes from the three-way
merge

A merge commit should only contain changes that fix the commit. Don’t add new things while resolving the conflict.

What files to include in the repository

Usually commit only source files, not generated files.
Do include platform independent build instructions in the repository too, but not IDE specific files or local configs.
No files that can be generated by the build system.
Read more about the .gitignore file. You can set global .gitignore file for your machine and keep the project .gitignore file specific to the project itself. Go to github/gitignore and gitignore.io to download/generate your .gitignore file.

Writing a good commit message

The seven rules of a great Git commit message, from this website

  1. Separate subject from body with a blank line (The block line separating the subject and the body is critical: various tools like ‘log’ and ‘rebase’ can get confused if you run the two together.)
  2. Limit the subject line to 50 characters
  3. Capitalize the subject line
  4. Do not end the subject line with a period
  5. Use the imperative mood in the subject line
    (A properly formed subject line will complete the sentence:
    If applied, this commit will your subject line here)
  6. Wrap the body at 72 characters
  7. Use the body to explain what and why vs. how. The diff will explain the how.

If you use an issue tracker, put references to them at the bottom, like this:

1
2
Resolves: #123
See also: #456, #789

Git Commands

git cherry-pick
git cherry-pick is a powerful command that enables arbitrary Git commits to be picked by reference and appended to the current working HEAD.
git rebase:
Rebase is one of two Git utilities that specializes in integrating changes from one branch onto another. The other change integration utility is git merge . Merge is always a forward moving change record. Alternatively, rebase has powerful history rewriting features.
git blame
The high-level function of git blame is the display of author metadata attached to specific committed lines in a file.
When you are interested in finding the origin for lines 40-60 for file foo, you can use the -L option like so (they mean the same thing — both ask for 21 lines starting at line 40):

1
2
git blame -L 40,60 foo
git blame -L 40,+21 foo

Git Workflows

See a great article discussing git workflows here.

  • Centralised workflow
  • Featured branches/ Topic branches. Discussed here.
  • Forking workflow
  • Gitflow workflow

Guidelines:

  • Short-lived branches
  • Minimize and simplify reverts (it’s beneficial to have a workflow that allows for easy reverts that will not disrupt the flow for other team members.)
  • Match a release schedule

Diffs in Version Control

Idea: represent change as operations on the file.

Diffs/patches are important for developers:

  • to review their own changes
  • to review changes/patches made by other people
  • to apply the patch to other versions of the code

Unified Diff Format

Operation model:

  • only insert and deletion
  • changes are applied in parallel
    See here for an example of unified diff format.
Unified Diff format

Computing Diff

diff[v->w], or diff(v,w) is a diff transforming v into w.
udiff[v->w], or udiff(v,w) is a unified diff transforming v into w.

Line Based Diffs

Operate on a string of lines

  • A “string of lines” means, that the alphabet is the set of all possible individual lines.
  • The actual diff algorithm is therefore a diff of strings over
    any character set C.

    Edit Script

    Finding the smallest edit script between two versions:
    Largest common subsequence (LCS)

The size of the smallest insert/delet edit script is called the
Levenshtein distance.

Edit Graph

Directed Acyclic Graph (DAG)

  • top left corner: origin
  • bottom right corner: sink
  • every path from origin to sink represents an edit script
  • Vertices:
    • column labels after 0 given by source string
    • row labels after 0 given by target string
  • Edges:
    • vertical downward move: edit: insert column label of
      endpoint of the edge.
    • horizontal move right: edit: delete row label of endpoint
    • diagonal move: keep, no edit, no change; requires matching column/row labels on endpoint. Diagonal move is available when the column label and row label at the endpoint are the same.
  • Paths:
    • The path to upper right corner, then down to sink: first delete whole source string, then insert whole target string.
    • every path from origin to bottom right corner represents
      an edit script.

Summary: Computing optimal line-based diffs

Equivalent to:

  • finding the largest common subsequence (LCS)
  • finding the shortest edit path (D-path)
  • finding the minimum cost path
    Greedy Principle
  • its enough to search on the furthest reaching D-paths to
    find optimal solution
  • Runtime O((N + M)D), or with input length n as O(nD)

Merge

Three-way merge

Take parent into account, and do the following three steps:

  • For both Dev A and B, do diff relative to parent.
  • apply both patches in parallel
  • if both patches change the same line -> conflict

If an automatic merge is not possible (because of conflicts),
conflicts can be solved using a merge tool.

  • npm-merge-driver can be used for automatic merging of lockfiles - you can install it globally into your machine:
    1
    $ npx npm-merge-driver install --global
    • Fork provides graphical three-way merge tool.
    • meld is a visual diff and merge tool targeted at developers: haven’t found out about it yet.

Summary

Diffs are fundamental for dealing with different versions
of a file.

  • create diffs
  • apply diffs
  • 3 way merge to incorporate concurrent changes of the
    same file
    Diffs are computed as edit scripts.
  • Our standard diffs are insert/delete scripts on lines.
  • We are usually interested in the smallest edit script; it
    also identifies the largest common subsequence.
  • Can be computed in O(nD) time, in practice often even
    faster. n is the length of input (M+N, M and N are length of start and end versions), and D is the number of differences.
myers diff

Conflicts and Edits

We now look at edit scripts with more operations: move, copy, cut/paste, and replaceAll.

Understanding different conflicts

  • Textual Conflicts: Edits in the same line, detected by merge tools
  • Semantic Conflicts: logical conflicts (e.g. method signature change)
    Both textual and semantic conflicts must be resolved by user.

parallel vs sequential in edit scripts

The comma is an operator, requires that for (s,r) the edit scripts s and r can be merged: edit scripts can be executed on the same source.
For a general edit script we also define sequential execution of edits, denoted with semicolon: (s;r)
In (s;r) the edit r is executed on the result of s:

1
d(2,1), i(4,’hi’) = d(2,1); i(3,’hi’)

Properties of parallel and sequential operators

  • The comma (s,r) is commutative and associative.
    • (d(2,1),i(4,'hi')) = (i(4, 'hi'), d(2,1))
    • (d(2,1), i(4,'hi')), d(7,1) = d(2,1)(i(4,'hi'), d(7,1))
    • Therefore, we use sorted order wherever possible.
  • The semicolon (s;r) is associative.
    • (d(2,1);i(3,'hi'));d(7,1) = d(2,1);(i(3,'hi'); d(7,1))
  • We can compose more complex edit scripts.
    • `(d(2,1), i(3,’hi’),i(5,’.’));d(7,1) = d(2,1), i(3,’hi’)

edits that are combined with the parallel comma operator may be undefined: this models a textual conflict.

More edit operations for edit scripts

  • Operations beyond insert/delete:
    • move, copy/cut + paste, replaceAll
  • Important question:
    • How do merges work with new operations?
      Finding strictly minimal edit script can become more complicated.

Operation replaceAll cannot be reconstructed with certainty from change file.
But: With new operations,

  • Some concrete merges may become easier.
  • We may understand semantic conflicts better.

idxcv scripts

delete d(pos, count), insert i(posB, string),
cut x(pos, count, X), the buffer is X
copy c(pos, count, X)
paste v(posB, X)
cut/paste operations can be combined with the parallel operator:
x(a,b,X),v(d,X) = v(d,X),x(a,b,X)
Idxcv scripts can be applied to a text:
(x(1,1,X), v(6,X))('!hello') = 'hello!'
Question: What is the result of the following expression?

1
x(7,1,X);v(6,X); d(2,1); v(1,X)(’hello u’) =?
1
2
1 2 3 4 5 6 7
h e l l o _ u

After x(7,1,X);v(6,x), no change really happens:
Note that posB of x or c starts right at that position, while for v and i the content is added right behind the given start position

1
2
3
1 2 3 4 5 6 7
h e l l o _ u
Buffer: u

After d(2,1):

1
2
3
1 2 3 4 5 6 7
h l l o _ u
Buffer: u

After v(1,X):

1
2
3
1 2 3 4 5 6 7
h u l l o _ u
Buffer: u

Edit scripts and 3-way merge

We have said that a 3-way merge combines work done on two branches in parallel.

  • The work in the two branches since the common ancestors can be represented as two idxcv scripts, lets name them p and q. The ancestor s is in our model simply a string.
  • Then the result of the 3-way merge is (p,q)(s).
  • We can now look at p and q and see if there is a conflict.
  • If branch p has three individual commits, then it can be written as p = (p1; p2; p3) separated by the semicolon operator. The 3-way merge is now ((p1; p2; p3), q)(s) and this may help us understand at which stage a possible conflict occurred.

Question: should the following edit script result in a conflict? If not, what’s the result?

1
((x(1,5,X), v(10,X)), (d(2,1); i(1,’u’)))(’hello_hi!_’)

No. can be merged without conflict.

1
2
1 2 3 4 5 6 7 8 9 10
h e l l o _ h i ! _

After (x(1,5,X), v(10,X)):

1
2
1 2 3 4 5 6 7 8 9 10
_ h i ! _ h e l l o

In parallel, after ((d(2,1); i(1,’u’))):

1
2
1 2 3 4 5 6 7 8 9 10
h u l l o _ h i ! _

Merge the above two versions together:

1
2
1 2 3 4 5 6 7 8 9 10
_ h i ! _ h u l l o

Logical positions in a string

  • The above merge works because we adopt the view that each character in the string has an identity.
  • This identity is not affected by change of its index
  • Standard model is text as an OO(doubly) linked list.
  • The logical positions that we are interested in our operations are based on a character’s context:
    • its position before a character
    • its position after a character

Logical change with fuzzy patching

Fuzzy patching patterns

fuzzy patching operators: patterns instead of numbers.
x(’the >’, ’<fox’,X), v(’r the >’, X)( ’the lazy fox jumps over the red cat’)
= the fox jumps over the lazy red cat

  • patterns: ad-hoc(as necessary) identifiers of logical positions:
    • prefix: pattern for position after a character: ’the >
    • suffix: pattern for position before a character: ’<fox”
    • fuzzy patching operators: patterns instead of numbers.
    • i(), v() have a single pattern as position, d(), x(), c() have
      two patterns as positions.

      How fuzzy patching simplifies rewrites

  • Often no rewriting needed if we switch from parallel to sequential operation:
    x(’the >’, ’<fox’,X), v(’r the >’, X)=x(’the >’, ’<fox’,X); v(’r the >’, X)
  • But fuzzy patching needs space for a unique pattern.

If the pattern is disturbed, we obviously have to repair it, so we need rewrites for such cases.
x(’Hi>’, ’<, H’,X), v(’Jo,>’, X)(’Hi Jo, Hello, great’)
=
x(’Hi>’, ’<, H’,X); v(’Hi,>’, X)(’Hi Jo, Hello, great’)
=
'Hi, Jo Hello, great'
= i('Hi,>', ' Jo')('Hi, Hello, great')

POSIX ENF convention

We might want to identify the ends of the input in the pattern. We agree to use the same signs as in POSIX regular expressions:

  • The ˆ matches the start of the string.
  • The $ matches the end of the string.
    1
    (x(’ˆthe >’, ’<fox j’,X), v(’r the >’, X))( ’the lazy fox and the quick fox jumped over the red cat’)
    The result is:
    ‘the fox jumped over the lazy fox and the quick red cat’

Textual conflict for logical positions

  • Clear case of textual conflict: Two parallel inserts/pastes at the same logical position, which means:
    • both at the same before-position or at the same after-position.
    • Reason it is a conflict: because of parallelism, not clear which one goes first.
    • This conflict case is the equivalent of the textual conflicts in udiff-based diff tools.
  • By contrast, two parallel inserts between the same characters, but one at the before-position and one at the after-position, create no conflict: result is clear.
    • This case cannot be expressed in udiff tools.

Merge with nested changes

  • If a change in one version fully nests within a cut of a move of a parallel version, there is no conflict.
    ((x(’r the >’, ’<cat’,X); v(’ˆthe >’,X)),d(’pretty >’, ’<red’))(’the fox jumps over the pretty much red cat’)
    = ‘the pretty red fox jumps over the cat’
  • What would be the outcome for ins/del edit script?
    • an insert of ’the pretty much red fox’ plus a textual conflict at the delete position: Has the word “much” been deleted when we insert “the pretty … red” part before the fox?
  • what about d() in parallel with surrounding d()? Isn’t this simply a d()?
    -Problem: We don’t know whether the surrounding d() is an x() for a later v()! If so, then the insert/delete merge would give the wrong result.

Therefore, with the rich operation model, we can perform more automatic merges.

Nested changes explained with commutativity

  • If a change in one version fully nests within a cut of a move of a parallel version, there is no conflict:
  • If they nest properly, they do not interfere. In other words, they commute. Execution in each order yields the same result.
  • The commutativity expresses the fact that the logical position for the inner change is available before and after the change.

Summary

  • Logical positions refer to identities of characters in the text.
  • Logical positions can be given with patterns for fuzzy patching.
  • Rewriting of operations is often not necessary when switching from comma to semicolon.
  • Textual conflicts can be specified in the extended model.
  • Logical positions explain why more merges are possible in our extended operation model, while the udiff(insert/delete) model might cause conflicts in the same situations.

Refactorings and merge

Refactoring with ra()

Scenario: One developer renames a public function in all appearances, and the other developer, in parallel, makes a call to the function with the old function name. The udiff-based 3-way merge tool does not produce a textual conflict for just these changes, but this conflict may be trapped in typed languages at compile time, or generally at unit test time.

We extend the idxcv scripts with a replaceAll operation ra() intended to be unsupervised => idxcvra

Syntax: ra(oldstring, newstring) replaces all occurrences of string oldstring with newstring.

ra() intended to be unsupervised.

Replaces all old string with new string

Merging idxcvra scripts

Example 1

(ra(’copy’, ’scan’), i(0, ’0. copy attachments, ’))(’1. copy letter, 2. copy invoice’)
Because ra() is unsupervised, safe merge result option is
’0. scan attachments, 1. scan letter, 2. scan invoice’

i(0, ’0. copy attachments, ’)) is based on text before ra() and should be considered to intend to refer to the function named m before the rename.

what if the 3-way merge is performed by an i()/d() based merge tool?
The effect of ra() gets translated into two i()/d() operation pairs, for the existing matches only. Overall result: ’0. copy attachments, 1. scan letter, 2. scan invoice’ -> A likely semantic conflict.

Example 2

1
2
((ra('m(', 'p('); i(0, 'm(s){p("out: " + s);}')),
i('<$', 'm("hi");'))('m(s){println(s);}m("hello");')

Because the programmer A writing i('<$', 'm("hi");')) would not know about the parallel branch i(0, 'm(s){p("out: " + s);}')), we have to assume that A is not aware of the refactoring and is trying to call the old function m.
Therefore the result is

1
('m(s){p("out: " + s);}p(s){println(s);}p("hello");'p("hi");)

Summary

  • One classic semantic conflict: conflict between a refactoring (a rename) and a new function call to the function under the old name.
  • If we see replaceAll as a distinct operation ra(), we can avoid the conflict and produce a safe merge result.
  • In determining the merge result we use the semantics of
    ra() as unsupervised replaceAll.

Next Steps

  • Learn more about greedy algorithms
  • Learn more about git in command lines and in graphical tools.