Jekyll2021-10-18T02:06:56+00:00http://sassafras13.github.io/feed.xmlEmma BenjaminsonMechanical Engineering Graduate StudentPearl’s Do-Calculus for Structural Causal Models2021-10-17T00:00:00+00:002021-10-17T00:00:00+00:00http://sassafras13.github.io/DoCalculus<p>In my last post, I introduced the idea of structural causal models (SCMs), and how you can use them to perform interventions and study counterfactuals. In this post, we are going to build on this idea by studying <strong>Pearl’s augmented SCM</strong> and the associated rules of <strong>do-calculus</strong> that we can use to write the conditional probabilities of an intervention on an SCM. We will end with a discussion about what to do when we encounter <strong>confounding variables</strong> [1].</p>
<h2 id="pearls-augmented-scm">Pearl’s Augmented SCM</h2>
<p>One way to write the conditional probabilities of an SCM that we have intervened on is to study an augmented SCM using a directed acyclic graph (DAG), G+. Before we dive into this, let’s take a quick step back and redefine for ourselves what an ordinary SCM, \(\mathcal{C}\) with a DAG G looks like. We can say that an ordinary SCM is described by a DAG G and some factors (also called <strong>structural assignments</strong>) [1]:</p>
\[X_j = f_j(\text{PA}_j, N_j)\]
<p>We say that \(\text{PA}_j\) are the parents of \(X_j\) in the DAG G, and the \(N_j\) terms are independent noise variables. These factors \(f_j\) are just another way of writing the conditional distribution of \(X_j\) given \(\text{PA}_j\). We can also say that \(\text{PA}_j\) are the causes of \(X_j\) [1].</p>
<p>Now, if we want to intervene on this SCM \(\mathcal{C}\) with DAG G, we would need to replace at least one of the factors, like so [1]:</p>
\[X_k = \tilde{f}(\tilde{\text{PA}}_k, \tilde{N}_k)\]
<p>The new SCM, \(\tilde{\mathcal{C}}\), is called an intervention SCM and the probability distribution corresponding to this SCM is called the <strong>intervention distribution</strong>. To write this new distribution we can say [1]:</p>
\[P^{\tilde{C}} = P^{do(X_k := \tilde{f}(\tilde{\text{PA}}_k, \tilde{N}_k)}\]
<p>Now that we have understood this, let’s return to the beginning of this section and explain what an augmented DAG G+ would look like. An augmented DAG G+ has all parentless nodes \(I_j\) pointing to variables \(X_j\) for all \(j \in [p]\). We say that \(I_j\) are members of the union of the domain of \(X_j\) and all of the non-intervention cases, that is [1]:</p>
\[I_j \in \{\text{no intervention}\} \cup \mathcal{X}_j\]
<p>If the nodes \(I_j = \{ \text{no intervention} \}\) then no intervention has occurred on those nodes. Conversely, if \(I_j \in \mathcal{X}_j\) then that means that we have intervened on \(X_j\) and set it equal to the value \(I_j\) [1].</p>
<p>Given this information about the role of \(I_j\), I can now define the augmented SCM \(\mathcal{C}+\) as containing factors [1]:</p>
\[X_j = \bar{f}_j (\text{PA}_j(G+), N_j)\]
<p>I can expand this [1]:</p>
\[X_j = f_j (\text{PA}_j, N_j) \mathbb{I}[I_j = \text{no intervention}] + I_j \mathbb{I}[I_j \neq \text{no intervention}]\]
<p>This says that \(X_j\) is equal to the factors \(f_j\) that were not intervened on, in addition to the factors \(I_j\) that were intervened on [1].</p>
<p>From this expression we can write the intervention distribution [1]:</p>
\[P_{Y|I_j = x_j}^{\mathcal{C}+} = P_{Y}^{\mathcal{C} ; do(X_j = x_j)}\]
<p>This says that the probability distribution of Y when there has been an intervention on \(X_j\) (RHS) is equal to the probability distribution of the augmented SCM \(\mathcal{C}+\)’s distribution for Y conditioned on the interventions \(I_j\). The statement above also leads to the following [1]:</p>
\[P^{\mathcal{C} ; do(X = x)}(Y) = P^{\mathcal{C}}(Y | X = x)\]
<p>This is true if Y is d-separated from \(I_X\) given \(X\) in the augmented DAG G+. We can derive this for ourselves as follows [1]:</p>
\[P^{\mathcal{C} ; do(X = x)}(Y) = P^{\mathcal{C}+}(Y | I_X = x)\]
\[P^{\mathcal{C} ; do(X = x)}(Y) = P^{\mathcal{C}+}(Y | I_X = x, X = x)\]
<p>Given that Y is d-separated from \(I_X\), we can remove that from the conditional probability [1]:</p>
\[P^{\mathcal{C} ; do(X = x)}(Y) = P^{\mathcal{C}+}(Y | X = x)\]
<p>And we know that the marginal distribution over the non-intervened factors is the same for both the original SCM and the augmented one so [1]:</p>
\[P^{\mathcal{C} ; do(X = x)}(Y) = P^{\mathcal{C}}(Y | X = x)\]
<p>There is another fact we can derive in a similar style [1]:</p>
\[P^{\mathcal{C} ; do(X = x)} (Y) = P^{\mathcal{C}}(Y)\]
<p>This is true if Y is d-separated from \(I_X\) in G+ [1].</p>
<h2 id="pearls-do-calculus">Pearl’s Do-Calculus</h2>
<p>These 2 rules that we wrote in the previous section can be extended to develop a “calculus” for writing intervention conditionals for SCMs. In this section we will write down all 3 rules explicitly.</p>
<h3 id="insertiondeletion-of-observations">Insertion/Deletion of Observations</h3>
<p>If W is d-separated from Y given Z and X in the graph \(G_Z^+\) (where \(G_Z^+\) is the augmented graph G+ with the incoming edges to Z removed) then [1]:</p>
\[P^{\mathcal{C} ; do(Z=z)}(Y | X, W) = P^{\mathcal{C}; do(Z=z)}(Y|X)\]
<h3 id="actionobservation-exchange">Action/Observation Exchange</h3>
<p>If Y is d-separated from \(I_X\) given X, Z and W in the graph \(G_Z^+\), then [1]:</p>
\[P^{\mathcal{C}; do(Z=z, X=x)}(Y|W) = P^{\mathcal{C}; do(Z=z)}(Y|X=x, W)\]
<p>The simplest case of this rule, where \(Z = W = 0\), is the same as one of the rules we looked at in the previous section, namely [1]:</p>
\[P^{\mathcal{C}; do(X=x)}(Y) = P^{\mathcal{C}}(Y|X=x)\]
<p>This is true if Y is d-separated from \(I_X\) given X in the graph G+.</p>
<h3 id="insertiondeletion-of-actions">Insertion/Deletion of Actions</h3>
<p>Finally, we can say that if Y is d-separated from \(I_X\) given Z and W in the graph \(G_Z^+\), then [1]:</p>
\[P^{\mathcal{C}; do(Z=z, X=x)}(Y|W) = P^{\mathcal{C}; do(Z=z)}(Y|W)\]
<p>And this relates to our other rule from the previous case if we set \(Z = W = 0\), that is [1]:</p>
\[P^{\mathcal{C}; do(X=x)}(Y) = P^{\mathcal{C}}(Y)\]
<p>This is true if Y is d-separated from \(I_X\) in G+.</p>
<h2 id="confounding-variables">Confounding Variables</h2>
<p>The mathematical definition of confounding variables says that if we have an SCM \(\mathcal{C}\) with a directed path from X to Y, then the causal effect from X to Y is confounded if [1]:</p>
\[P^{\mathcal{C}; do(X=x)} (Y) \neq P^{\mathcal{C}}(Y | X=x)\]
<p>Otherwise the causal effect is not confounded. Confounding variables can be used to explain the dependence between X and Y. Let’s consider an example to clarify this idea.</p>
<p>Let’s say we want to understand the patient’s probability of recovering, R, from a disease given some treatment, T. The treatment, T, is causally related to the recovery, R. Both of these random variables are binary, i.e. [1]:</p>
\[T, R \in \{0, 1\}\]
<p>We want to know the effect that the treatment has on the patient’s recovery, specifically [1]:</p>
\[P^{\mathcal{C}; do(T=1)} (R = 1) - P^{\mathcal{C}; do(T=0)}(R=1)\]
<p>We find that this difference is <strong>not</strong> equal to the difference between the conditional probabilities of recovery [1]:</p>
\[\neq P^{\mathcal{C}}(R = 1 | T=1) - P^{\mathcal{C}}(R = 1 | T=0)\]
<p>This indicates that there is a confounding variable present between T and R, which we will call Z. If we look at the original and intervention SCM for this system, we can see that there are some conditionals in the the intervention probability distribution that are equal to the original probability distribution, such as [1]:</p>
<p>\(P^{\mathcal{C}; do(T=1)}(R=1|Z=z) = P^{\mathcal{C}}(R=1 | T=1, Z=z)\)
\(P^{\mathcal{C}; do(T=1)}(Z=z) = P^{\mathcal{C}}(Z=z)\)</p>
<p>These equalities allow us to rewrite the intervention probability distribution in terms of conditional probabilities that we can observe, that is [1]:</p>
<p>\(P^{\mathcal{C}; do(T=1)}(R=1) = \sum_z P^{\mathcal{C}; do(T=1)}(R=1, T=1, Z=z)\)<br />
\(= \sum_z P^{\mathcal{C}; do(T=1)}(R=1 | T=1, Z=z) P^{\mathcal{C}; do(T=1)}(T=1, Z=z)\)<br />
\(= \sum_z P^{\mathcal{C}; do(T=1)}(R=1 | T=1, Z=z)P^{\mathcal{C}; do(T=1)}(Z=z)\)<br />
\(= \sum_z P^{\mathcal{C}}(R=1 | T=1, Z=z) P^{\mathcal{C}}(Z=z)\)</p>
<p>All of the probabilities in the last line of the equation above are things that we can observe without intervening on the SCM. In this situation, we can see that Z is a <strong>valid adjustment set</strong> for X and Y because by taking it into account we can completely define the causal relationship for X and Y now [1].</p>
<p>Okay, I think that’s all I want to say today on this topic. Stay tuned for more posts on performing inference with PGMs!</p>
<h2 id="references">References</h2>
<p>[1] Ravikumar, P. “Causal GMs.” 10-708: Probabilistic Graphical Models. 2021. Class notes.</p>In my last post, I introduced the idea of structural causal models (SCMs), and how you can use them to perform interventions and study counterfactuals. In this post, we are going to build on this idea by studying Pearl’s augmented SCM and the associated rules of do-calculus that we can use to write the conditional probabilities of an intervention on an SCM. We will end with a discussion about what to do when we encounter confounding variables [1].A Brief Introduction to Structural Causal Models2021-10-04T00:00:00+00:002021-10-04T00:00:00+00:00http://sassafras13.github.io/SCMs<p>In this post we are going to talk about a different kind of graphical model than in previous posts, known as a causal model. Previously I have written about probabilistic graphical models which are <strong>statistical</strong> models which learn from observations about the real world. But now we are going to consider <strong>causal</strong> models, where we learn from observations and <strong>interventions</strong> made during the process that we are modeling [1]. In this post we will introduce some basic ideas related to causal models and experiment with interventions on simple structural casual models.</p>
<h2 id="interventions-and-counterfactuals">Interventions and Counterfactuals</h2>
<p>The major difference between causal and statistical models is that in causal models, we can intervene in the process we are modeling. Specifically, we can set the value of a random variable to a specific value, and see how that impacts the other random variables in the model [1].</p>
<p>Another way of reasoning about causal models is to consider <strong>counterfactuals</strong>, where we observe the outcome of a process given the value of a certain variable, and then we ask what would have been the outcome if we had set the value of a random variable to a different value. This is subtly different than interventions [1].</p>
<p>Note that we can think of different graphical models as fitting into a hierarchy, as follows [1]:</p>
<ul>
<li>Statistical models: capable of handling observations</li>
<li>Causal models: capable of considering observations and interventions</li>
<li>Structural causal models: capable of handling observations, interventions and counterfactuals</li>
<li>Physical models: able to do everything a structural causal model can do, as well as giving insight into the real world.</li>
</ul>
<h2 id="independent-mechanisms">Independent Mechanisms</h2>
<p>A concept that we need to develop for causal models is the idea of <strong>independent mechanisms</strong>. Consider 2 random variables, A and B. In order to determine which variable is the cause, and which is the effect, we would need to intervene on each variable in turn and see if intervening in one affects the other. More specifically, let’s say we have a joint probability distribution [1]:</p>
\[P(A, B) = P(A) P(B | A)\]
<p>In this case, if A is the cause and B is the effect, then \(P(A)\) and \(P(B \| A)\) are independent mechanisms. In other words, if we changed P(A), we would not expect that the mechanism of \(P(B \| A)\) would change. But if B were the cause, then changing B would change the mechanism \(P(B \| A)\) [1].</p>
<p>This idea leads to the <strong>Principle of Independent Mechanisms</strong> which states that a causal generative process is made up of separate models that do not have any influence on one another. If the generative process describes a joint probability distribution, then all the conditional distributions of effects conditioned on causes do not influence the other conditional distributions [1]. This principle gives rise to 3 things that must be true for causal models [1]:</p>
<ul>
<li>We should be able to affect one mechanism (i.e. intervene on one mechanism) without affecting any of the others.</li>
<li>One mechanism should not provide any information about other mechanisms. Similarly, the changes in one mechanism should not tell us anything about how other mechanisms might have changed.</li>
<li>It is possible to write down the conditional distributions (i.e. the mechanisms) as a set of random variables and noise variables, for example [1]:</li>
</ul>
<p>\(C = N_C\)<br />
\(E = f_E(C, N_E)\)</p>
<p>In this case, \(N_C\) and \(N_E\) are noise variables and \(f_E( \cdot)\) is some function. In this situation, \(N_C\) and \(N_E\) must be statistically independent with respect to each other for these two mechanisms to be independent.</p>
<h2 id="simple-structural-causal-models-and-interventions">(Simple) Structural Causal Models and Interventions</h2>
<p>Building on the equations shown above, they represent a simple causal model with two variables that can be represented with a <strong>structural causal model</strong> (SCM). This representation includes the equations shown above as well as a <strong>causal graph</strong> that has nodes \(\{C, E\}\) and one directed edge \(C \rightarrow E\) [1].</p>
<p>We can intervene in this causal model by changing the causal mechanism for one of the random variables. There are two sub-categories of interventions based on this idea: <strong>hard</strong> and <strong>soft</strong> interventions. Hard interventions simply refer to setting the value of one of the random variables to a constant - in the example above, that would be like setting \(E = 3\). The SCM formed by this intervention is written as \(P^{\text{do}(E = 3)}\) [1].</p>
<p>Conversely, soft interventions refer to changing the function for one of the random variables, i.e. changing from \(E = f_E(C, N_E)\) to \(E = g(C, \tilde{N}_E)\). In this case, the associated SCM for this intervention is \(P^{\text{do}(E = g(C, \tilde{N}_E)}\) [1].</p>
<p>Let’s think about what we expect to happen if we intervene on C and E, respectively. If we intervene on the effect, E, then it should make sense that the cause mechanism would not change. Therefore we know that [1]:</p>
\[P_C^{\text{do}(E = e)} = P_C \neq P_{C|E=e}\]
<p>This expression says that intervening on the effect is not the same as conditioning the cause on the effect. (In other <em>other</em> words, intervening on the effect does not give the same probability distribution for the cause as we would get if we had information about the effect and computed the probability for the cause based on that knowledge.)</p>
<p>Now let’s intervene on the cause variable, C. If we do this, then we find that the probability of the effect after the intervention <em>is</em> equal to the probability of the event conditioned on the cause [1]:</p>
\[P_E^{\text{do}(C=c)} = P_{E|C=c} \neq P_E\]
<p>With that, I will close out this discussion on some of the basics of causal graphical models.</p>
<h2 id="references">References</h2>
<p>[1] Ravikumar, P. “Causal GMs.” 10-708: Probabilistic Graphical Models. 2021. Class notes.</p>In this post we are going to talk about a different kind of graphical model than in previous posts, known as a causal model. Previously I have written about probabilistic graphical models which are statistical models which learn from observations about the real world. But now we are going to consider causal models, where we learn from observations and interventions made during the process that we are modeling [1]. In this post we will introduce some basic ideas related to causal models and experiment with interventions on simple structural casual models.Chain Graphs and LWF and AMP Graph Properties2021-10-03T00:00:00+00:002021-10-03T00:00:00+00:00http://sassafras13.github.io/Graphs3<p>Today we are going to introduce a more general class of graphical models, called <strong>chain graphs</strong>, that encompass both directed and undirected graphical models. They are useful for looking at connections between groups of random variables, as we will see below. We will start by describing some of the basic features of chain graphs, and then talk about conditional independencies and other properties in two contexts: LWF and AMP.</p>
<h2 id="basic-properties-of-chain-graphs">Basic Properties of Chain Graphs</h2>
<p>Chain graphs are allowed to have both directed and undirected edges, as long as they do not have a cycle that contains a directed edge [1]. This rule gives two important properties to chain graphs [1]:</p>
<p><strong>1.</strong> We can divide the graph’s nodes, \(V\), into a disjoint partition (i.e. no overlap between the subsets), \(\{V_j\}_{j=1}^k\), where the subgraph for each subset, \(G[V_j]\), has no directed edges.</p>
<p><strong>2.</strong> For any pair of nodes, (X, Y), there is a directed edge from X to Y only if X is in an earlier subgraph than Y.</p>
<p>In this way, we have groups of nodes (<strong>chain components</strong>) in a chain graph that have flow along directed edges from one group to the next, hence forming a chain. This representation encompasses both DGMs - where the chain components are individual nodes - and UGMs - where the entire graph is one chain component [1].</p>
<h2 id="conditional-random-fields">Conditional Random Fields</h2>
<p>One type of chain graph is a <strong>conditional random field</strong> (CRF), which can be used to represent conditional probability distribution. In a CRF, we say that it represents a conditional probability as a product of factors as follows [1]:</p>
\[P(Y | X) = \frac{1}{Z(X)} \Pi_{C \in \mathcal{C}(G) | C \notin \mathcal{C}(\mathcal{X})} \phi_C((Y, X)_C)\]
<p>In this formulation, we assume that our inputs are represented by X and our targets by Y. We want to predict Y given X (this is a discriminative model) and the undirected edges of the graph G include those that connect nodes within Y, and nodes from X to Y, but we do not include edges between nodes in X. This paradigm of CRFs allows us to handle many different kinds of inputs, which may have their own complex relationships among themselves, because we focus instead on how those inputs relate to outputs [1].</p>
<h2 id="lwf-markov-properties-and-c-separation">LWF Markov Properties and C-Separation</h2>
<p>In an <a href="https://sassafras13.github.io/Graphs2/">earlier post</a> we spent some time defining the Markov properties, separation and D-separation for UGMs and DGMs, because these definitions allowed us to find conditional independencies in the associated models. The idea of Markov properties for chain graphs, as well as c-separation, are a more generalized form of these concepts, which encompass the corresponding definitions for both UGMs and DGMs [1].</p>
<p>Let’s start by talking about the Lauritzen, Wermuth and Frydenberg (LWF) Markov properties for chain graphs. According to this interpretation, the pairwise Markov properties for a chain graph G are [1]:</p>
\[\mathcal{I}_p(G) = \{X \perp Y | \text{NON-DESC}_X - \{X, Y\} : (X, Y) \notin E(G), Y \in \text{NON-DESC}_X\}\]
<p>That is, X is conditionally independent of Y given all of the non-descendents of X, excluding X and Y themselves, and where there is no edge directly connecting X and Y. This definition still holds for both UGMs and DGMs. Let’s now define the local Markov properties [1]:</p>
\[\mathcal{I}_l (G) = \{X \perp \text{NON-DESC}_X - \text{BOUNDARY}_X | \text{BOUNDARY}_X \}\]
<p>For DGMs, the boundary of X are its parents, and for UGMs the boundary of X are all of its neighbors [1].</p>
<p>Finally, in order to define global Markov properties, I need to first define c-separation. If we have a disjoint partition of nodes in a graph, \(U = X \cup Y \cup Z\), then X is c-separated from Y given Z if X is separated from Y given Z in a moralized subgraph. Specifically, this moralized subgraph is the induced subgraph over the nodes in U, i.e. \(\mathcal{M}[G[U \cup \text{ANCES}_U^+]]\). The term \(\text{ANCES}_U^+\) is an <strong>upward closure</strong> over the nodes in U. More precisely, \(\text{ANCES}_U^+\) contains all the nodes in U as well as the nodes on U’s boundary [1].</p>
<p>For DGMs, this upward closure is equivalent to the ancestors of the nodes in the partition, so this definition simplifies to D-separation. Similarly, for UGMs the upward closure contains all the other nodes in the graph, this concept of c-separation also applies to UGMs [1]. In full, then, the global LWF Markov properties for a chain graph can be written as [1]:</p>
\[\mathcal{I}(G) = \{X \perp Y | Z \text{ s.t. } \text{ CSEP }(X, Y | Z) \forall \text{ disjoint } X, Y, Z \in V\}\]
<h2 id="alternative-markov-properties-amp">Alternative Markov Properties (AMP)</h2>
<p>So far the LWF Markov properties have seemed like a natural way to define conditional independencies for the graphs we’ve seen. However, there is another way to parameterize graphical models which, in turn, gives rise to another way to determine conditional independencies, called Alternative Markov Properties (AMP) [1]. To see how this works, let’s consider a really simple 4-node graph that has two chain components as shown in Figure 1.</p>
<p><img src="/images/2021-10-03-Graphs3.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Inspired by [1]</p>
<p>In LWF terms, we can see that [1]:</p>
<p>\(1 \perp 4 \| \{2, 3\}\) <br />
\(2 \perp 3 \| \{1, 4\}\)</p>
<p>But if we express the relationship between these 4 random variables another way, then we can look at things a little differently. Specifically, let’s apply a <strong>structural equation modeling</strong> (SEM)*1 approach to describing this graph [1]:</p>
<p>\(X_1 = \epsilon_1\)<br />
\(X_2 = \epsilon_2\)<br />
\(X_3 = b_{31}X_1 + \epsilon_3\)<br />
\(X_4 = b_{42}X_2 + \epsilon_4\)</p>
<p>Here the noise vectors are represented by \(\epsilon_i\), and we can assume that \((\epsilon_1, \epsilon_2) \perp (\epsilon_3, \epsilon_4)\) but that it is possible that \(\epsilon_1\) is related to \(\epsilon_2\) since they are in the same chain component [1]. Now that I have the same 4 random variables captured in this SEM representation, it should make sense that we can write the conditional independencies slightly differently [1]:</p>
<p>\(4 \perp 1 \| 2\)<br />
\(3 \perp 2 \| 1\)</p>
<p>The alternative Markov properties also give rise to another notion of separation, called <strong>AMP-Separation</strong>. The definition of AMP-separation for a chain graph, G, is that if we consider 3 disjoint sets X, Y and Z then AMP-separation holds if X is separated from Y given Z in the undirected graph which is [1]:</p>
<p>\(\mathcal{A}[G[U \cup \text{DIR-ANCES}_U] \cup UG[G[U \cup \text{UG-CONNECT}_U]]]\)$</p>
<p>In order to understand this expression, we need to define some additional terms. First of all, \(\mathcal{A}[G]\) is termed the <strong>augmentation</strong> of a chain graph, G. This means that all of the flags and double flags in a chain graph have been augmented. A <strong>flag</strong> in a chain graph, G, is a group of 3 nodes that have any of the following sets of edges in G [1]:</p>
<p>\(X \rightarrow Y - Z\) <br />
\(X - Y \leftarrow Z\) <br />
\(X \rightarrow Y \leftarrow Z\)</p>
<p>Similarly, a <strong>double-flag</strong> in a chain graph, G, is a set of 4 nodes that have any of the following sets of edges in G [1]:</p>
<p>\(X \rightarrow Y - Z\) <br />
\(U \rightarrow Z - Y\)</p>
<p>To augment a flag, we add the edge (X-Z), and to augment a double-flag, we add the edges (X, Z), (Y, U), (X, U). Therefore \(\mathcal{A}[G]\) is the fully augmented chain graph [1].</p>
<p>There are 3 more terms that we need to define. First, we have a <strong>directed ancestral set</strong>, \(\text{DIR-ANCES}_U\), that is the set of all nodes, v, that have a directed path from v to a node in U. And secondly, we have an equivalent term for undirected paths, that is an <strong>undirected connected set</strong> \(\text{UG-CONNECT}_U\), which is the set of all nodes v such that there is an undirected path from v to some node in U. Finally, we have the term \(UG[G]\) which is the chain graph, G, with all of the directed edges removed [1].</p>
<p>That’s about all I have to share today. Next time I expect to write more about causality with graphical models. Stay tuned!</p>
<h2 id="footnotes">Footnotes</h2>
<p>*1 Structural equation modeling is a term that is a little broad but is generally used in the social sciences to model a system with some structure (i.e. a graph) that captures causal relationships between variables and includes some statistics to characterize these variables [2].</p>
<h2 id="references">References</h2>
<p>[1] Ravikumar, P. “Chain Graphical Models.” 10-708: Probabilistic Graphical Models. 2021. Class notes.</p>
<p>[2] “Structural equation modeling.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Structural_equation_modeling">https://en.wikipedia.org/wiki/Structural_equation_modeling</a> Visited 03 Oct 2021.</p>Today we are going to introduce a more general class of graphical models, called chain graphs, that encompass both directed and undirected graphical models. They are useful for looking at connections between groups of random variables, as we will see below. We will start by describing some of the basic features of chain graphs, and then talk about conditional independencies and other properties in two contexts: LWF and AMP.Undirected and Directed Graph Properties2021-09-20T00:00:00+00:002021-09-20T00:00:00+00:00http://sassafras13.github.io/Graphs2<p>Today we are back and talking about the properties of undirected graphical models and directed graphical models. The term <strong>undirected</strong> simply means that the edges connecting the nodes of the graph do not flow in a particular direction. Conversely, <strong>directed</strong> graphs have edges that only allow flow in certain directions.</p>
<h2 id="undirected-graphs-and-global-markov-properties">Undirected Graphs and Global Markov Properties</h2>
<p>Let’s say we have an undirected graph (UG) written as \(G = (V, E)\) where the nodes, \(V\), represent variables in a random vector of variables, \(X\). There are some properties that the random vector, \(X\), can satisfy to ensure conditional independence. We call these “global Markov properties” and we should be able to find them just by looking at the graph structure [1].</p>
<p><img src="/images/2021-09-20-Graphs2-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source [2]</p>
<p>Let’s consider a UG as shown in Figure 1. We can say there is a path from node 1 to 6 that runs as follows: 1 -> 2 -> 3 -> 5 -> 6. This is an <strong>active</strong> path given nodes {4, 7} because nodes 4 and 7 are not contained in this path [1].</p>
<p>Conversely, if we want to write down what nodes separate node 1 from node 6, then we would say that the set Z = {2, 3, 5} separates nodes 1 and 6. This is because there is no active path from 1 to 6 without these nodes. In other words, if we remove the nodes 2, 3 and 5, then nodes 1 and 6 lie in disconnected graph components. This idea of separation for UGs can be written as [1]:</p>
\[\text{SEP}_G(X, Y | Z)\]
<p>More formally then, for a given graph G, I can write a list of conditional independencies like so [1]:</p>
\[\mathcal{I}(G) = {X \perp \!\!\! \perp Y | Z : \text{SEP}_G(X, Y | Z)}\]
<p>This list is called the <strong>global Markov properties</strong> of the UG \(G\) [1].</p>
<p>There is an interesting idea in graphical models that there is an overarching distribution, \(P\), which can be represented by some graph \(G\). We often say that <strong>the distribution \(P\) factors according to \(G\)</strong>. We can also say that any distribution \(P\) that factors according to \(G\) will satisfy the global Markov properties associated with \(G\), that is \(\mathcal{I}(G) \subseteq \mathcal{I}(P)\), where \(\mathcal{I}(P)\) is the set of all conditional independencies satisfied by \(P\) [1].</p>
<p>We also have 2 other kinds of Markov properties for a UG. First, there exist <strong>pairwise Markov properties</strong> which is the set [1]:</p>
\[\mathcal{I}_p(G) = {X \perp \!\!\! \perp Y | \mathcal{X} - {X, Y} : (X, Y) \notin E(G)}\]
<p>Where \(\mathcal{X} = X \cup Y \cup Z\). This basically says that \(X\) and \(Y\) are conditionally independent given all the nodes that are not contained in \(X\) and \(Y\), where there are no edges directly connecting \(X\) and \(Y\) (i.e. where they are disconnected subgraphs given all the other nodes in the graph) [1].</p>
<p>In order to introduce the other kind of Markov properties, we first need to define a <strong>Markov blanket</strong> as follows [1]:</p>
\[MB_G(X) = \text{NBRS}(X)\]
<p>Where \(\text{NBRS}(X)\) means the set of neighbors of nodes X. So now I can describe the <strong>local Markov properties</strong> where [1]:</p>
\[\mathcal{I}_l(G) = {X \perp \!\!\! \perp \mathcal{X} - {X} - MB_G(X) | MB_G(X)}\]
<p>This means that the nodes in X are conditionally independent of all the other nodes in the graph except other nodes in X and the immediate neighbors of X, given the neighbors of X. That is, if we removed all the immediate neighbors of X from the graph, that would make X a disconnected graph from the rest of \(\mathcal{X}\), indicating that it is conditionally independent from everything else in the graph [1].</p>
<h2 id="d-separation-for-dgms">D-Separation for DGMs</h2>
<p>Similar to UGMs, for DGMs we can say that two sets of nodes, \(X\) and \(Y\), are conditionally independent given another set of nodes, \(Z\), if the nodes \(Z\) separate \(X\) and \(Y\). However, the idea of separation for DGMs is a little more complicated than for UGMs, so we need to spend some time defining the DGM-specific form of separation, \(\text{DSEP}_G\). So far, we know that [3]:</p>
\[\mathcal{I}(G) = {X \perp \!\!\! \perp Y | Z : \text{DSEP}_G(X, Y | Z)}\]
<p>Consider Figure 2. We say that there is a path from A to F as long as there is a set of edges connecting them (not necessarily all pointing the same way). For example, a valid path from A to F would be A -> B -> C -> E -> F. We would say that A was <strong>d-separated</strong> from F if all the paths from A to F were blocked by some subset \(Z\), which in this case might be \(Z = {B, C, D, E}\) [3].</p>
<p><img src="/images/2021-09-20-Graphs2-fig2.png" alt="Fig 1" title="Figure 2" /> <br />
Figure 2 - Source [4]</p>
<p>On the surface this might seem as straightforward as the notion of separation for UGMs. But there are some cases where this will get more complicated. Let’s look at some of them [3]:</p>
<ul>
<li>
<p><strong>Causal trail</strong>: \(X \rightarrow Z \rightarrow Y\) is blocked (i.e. \(X \perp \!\!\! \perp Y\)) iff \(Z\) is observed (i.e. given R\(Z\)).</p>
</li>
<li>
<p><strong>Evidential trail</strong>: \(X \leftarrow Z \leftarrow Y\) is blocked iff \(Z\) is observed.</p>
</li>
<li>
<p><strong>Common cause</strong>: \(X \leftarrow Z \rightarrow Y\) is blocked iff Z is observed. One example of this is if X was shoe size and Y was gray hair. They are marginally dependent, but if we conditioned them on age, Z, then they are conditionally independent of each other now that we know how old the person is.</p>
</li>
<li>
<p><strong>Common effect</strong>: \(X \rightarrow Z \leftarrow Y\) is blocked iff <strong>neither</strong> Z nor any of its descendents, \(\text{DESC}_Z\) is observed. This indicates that \(\text{DSEP}(X, Y)\) but not necessarily given \(Z\). One way to think about this is if \(X\) is vomiting and \(Y\) is having a sore throat, and \(Z\) is having a cold. If we know that we have a cold and a sore throat, then it is less likely that we are also vomiting, but it is not zero probability. That is, \(X\) is not independent of \(Y\) given \(Z\). (In other words, I could be hungover and have a cold at the same time.)</p>
</li>
</ul>
<p>Given what we know now, a path in a DGM is blocked given some set of nodes, \(Z\), iff one of these two things are true [3]:</p>
<ol>
<li>
<p>There is a <strong>v-structure</strong> of consecutive nodes \(X_{i-1} \rightarrow X_i \leftarrow X_{i+1}\) such that neither \(X_i\) nor one of its descendents are in Z.</p>
</li>
<li>
<p>There is a node in \(Z\) that is not a common child in a v-structure (i.e. that is not \(X_i\)).</p>
</li>
</ol>
<h2 id="references">References</h2>
<p>[1] Ravikumar, P. “UGMs: Markov Properties.” 10-708: Probabilistic Graphical Models. 2021. Class notes.</p>
<p>[2] Terelius, Håkan. (2010). Distributed Multi-Agent Optimization via Dual Decomposition.</p>
<p>[3] Ravikumar, P. “DGMs: Markov Properties.” 10-708: Probabilistic Graphical Models. 2021. Class notes.</p>
<p>[4] “The web as a directed graph.” Computer Science Wiki. https://computersciencewiki.org/index.php/The_web_as_a_directed_graph Accessed 20 Sept 2021.</p>Today we are back and talking about the properties of undirected graphical models and directed graphical models. The term undirected simply means that the edges connecting the nodes of the graph do not flow in a particular direction. Conversely, directed graphs have edges that only allow flow in certain directions.The Basics for Graphical Models2021-09-18T00:00:00+00:002021-09-18T00:00:00+00:00http://sassafras13.github.io/Graphs<p>I have started to learn about the mathematical fundamentals behind graphical models, and so I am going to write some posts about what I am learning. Today I am going to start with a relatively simple introduction to the idea of graph networks and a review of probability. Many thanks to Daphne Koller and Nir Friedman’s excellent textbook for the education [1].</p>
<h2 id="introduction-to-probabilistic-graphical-models">Introduction to Probabilistic Graphical Models</h2>
<p>Probabilistic graphical models (aka graph networks) are a way of encoding a “declarative representation” of a system [1]. They separate knowledge from reasoning. The representation contains information separately from algorithms that can act over the graph, so by building a graph, we allow ourselves to apply a whole group of algorithms that generally work over all kinds of graphs [1].</p>
<p>PGMs are really good for systems that have a lot of uncertainty. They also have structure that allows them to describe complex joint probability distributions in a compact way [1].</p>
<p><img src="/images/2021-09-18-Graphs-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source [1]</p>
<p>For example, this notation means that “\(X\) is conditionally independent of \(Y\) given \(Z\)”:</p>
\[X \perp \!\!\! \perp Y | Z\]
<p>So for example, we can say that:</p>
\[(congestion \perp \!\!\! \perp season | flu, hayfever)\]
<p>which means that the random variable congestion is independent of the season given knowledge about the random variables flu and hayfever’s values. In other words, if I want to know the probability distribution of having congestion, and I know the values of the flu and hayfever random variables, then knowing about the season is no longer useful to me - i.e. knowing the season doesn’t tell me anything I didn’t already know at this point [1]. In mathematical terms:</p>
\[P(congestion | flu, hayfever, season) = P(congestion | flu, hayfever)\]
<p>Another way to think about graphs is that they represent a set of factors that can be multiplied together to compute a probability distribution. This set of factors is a more compact way of representing the entire probability distribution [1]. As shown in Figure 1, we can write:</p>
\[P(S, F, H, C, M) = P(S) P(F | S) P(H | S) P(C | F, H) P(M | F)\]
<p>Some key things to remember [1]:</p>
<ul>
<li>
<p><strong>Bayesian networks use directed graphs</strong> and <strong>Markov networks use undirected graphs</strong>.</p>
</li>
<li>
<p>The lack of an edge between 2 nodes indicates conditional independence between those two nodes. The presence of an edge does not necessarily indicate dependence, though!</p>
</li>
<li>
<p>A distribution, P, is positive if, \(\forall\) events \(\alpha \in \mathcal{S}\) such that \(\alpha \neq 0\) we have \(P(\alpha) > 0\). Note that by definition the probability of an event can be 0, so the distinction in stating that a distribution is positive means that no event can have a probability of zero.</p>
</li>
</ul>
<h2 id="basic-probability">Basic Probability</h2>
<p>Events have a space of possible values, or outcomes. For rolling a dice, the result can be {1,2,3,4,5,6}. We can assign probabilities to each event [1].</p>
<p>There is a <strong>frequentist vs subjective</strong> debate over how to interpret the values of probabilities. The frequentist view sees probabilities as describing how frequently an event will occur. For example, it tells you how often a roll of the dice will result in rolling a 3. This doesn’t work so well for probabilities that describe, for example, the likelihood of snow today. It will either snow or not today - it is a single event. The subjective perspective thinks of probabilities instead describing how strongly we believe something is going to happen. For example, if the probability of snow today is 50%, then I think it is equally likely that it will or will not snow [1].</p>
<p>If I want to use my knowledge about the probability of one event to reason about the probability of another, I can use conditional probability. The formal definition of event \(\beta\) happening given knowledge about event \(\alpha\) is [1]:</p>
\[P(\beta | \alpha) = \frac{ P(\alpha \cap \beta) }{ P(\alpha) }\]
<p>Conditional probabilities have a property known as the <strong>chain rule</strong>, which just rearranges the equation above [1]:</p>
\[P( \alpha \cap \beta) = P(\alpha) P(\beta | \alpha)\]
<p>And in general this works over more events [1]:</p>
\[P( \alpha_1 \cap . . . \cap \alpha_k) = P(\alpha_1) P(\alpha_2 | \alpha_1) . . . P(\alpha_k | \alpha_1 \cap . . . \cap \alpha_{k-1})\]
<p>We also have <strong>Bayes Rule</strong> [1]:</p>
\[P(\alpha | \beta) = \frac{P(\beta | \alpha) P(\alpha)}{ P(\beta) }\]
<p>Formally, <strong>random variables describe properties of the outcome of an event</strong>. For example, a student is an event and a random variable describing the student is their grade. It can be thought of as a function that maps the event to its attributes. Random variables can be discrete or continuous [1].</p>
<p>A <strong>marginal distribution is the distribution over events that have random variable X</strong>. P(X) is a marginal distribution over the random variable X. A <strong>joint distribution is a distribution that gives the probabilities that events occur which have the properties described by all the random variables \(X_i\)</strong> [1].</p>
<p>The joint distribution has to be consistent with the marginal distribution, that is [1]:</p>
\[P(x) = \sum_y P(x, y)\]
<p>In fact, the term marginal refers to the fact that when we sum over all events \(y\) to find \(P(x)\), we write the sums in the margins of the probability table [1].</p>
<p>Conditional probability is not the same as marginal probability. Marginal probabilities tell us about our prior knowledge about the system before we know anything else about a specific event. The <strong>conditional distribution represents what we know after learning information about the event</strong> [1].</p>
<p>There is a notion of <strong>independence</strong> in probability. In general we expect that</p>
\[P(\alpha|\beta) \neq P(\alpha)\]
<p>In other words, we expect that knowing \(\beta\) changes our probability distribution over \(\alpha\). But in some situations, learning \(\beta\) can have no impact on \(P(\alpha)\), i.e.</p>
\[P(\alpha | \beta) = P(\alpha)\]
<p>Formally, an event \(\alpha\) is independent of event \(\beta\) in P, or</p>
\[P = (\alpha \perp \!\!\! \perp \beta) \text{ if } P( \alpha | \beta) = P(\alpha) \text{ or if } P(\beta) = 0\]
<p>Another way to say this is that \(P(\alpha \cap \beta) = P(\alpha)P(\beta)\) [1].</p>
<p>In the wild, it is more likely that we will see <strong>conditional independence</strong> than pure independence. That is, we are more likely to see cases where 2 events are independent given knowledge about an additional event. The formal definition for conditional independence is that event \(\alpha\) is independent of event \(\beta\) given event \(\gamma\) in P such that [1]:</p>
\[P = (\alpha \perp \!\!\! \perp \beta | \gamma) \text{ if } P(\alpha | \beta \cap \gamma) = P(\alpha | \gamma) \text{ or if } P(\beta \cap \gamma) = 0\]
<p>Another way to look at it [1]:</p>
\[P = (\alpha \perp \!\!\! \perp \beta | \gamma) \text{ iff } P(\alpha \cap \beta | \gamma) = P(\alpha | \gamma) P(\beta | \gamma)\]
<p>That’s all the basics for now. Next time I expect I will start writing about Markov Properties and possibly about different kinds of graphs. Thanks for reading!</p>
<h2 id="references">References</h2>
<p>[1] Koller, D., Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press. Cambridge, Massachusetts. 2009.</p>I have started to learn about the mathematical fundamentals behind graphical models, and so I am going to write some posts about what I am learning. Today I am going to start with a relatively simple introduction to the idea of graph networks and a review of probability. Many thanks to Daphne Koller and Nir Friedman’s excellent textbook for the education [1].Google’s Python Style Guide Part 1 - Functionality2021-07-11T00:00:00+00:002021-07-11T00:00:00+00:00http://sassafras13.github.io/PythonStyleGuideFunc<p>I just recently learned that Google published a style guide for how their developers write clean code in Python [1]. I wanted to use a couple of posts to outline some of the things I learned from that style guide. I will write this post to describe some of the functional recommendations given in the style guide, and a follow-up post will detail some of the specific style requirements Google listed. Let’s get started!</p>
<h2 id="googles-python-style-guide---functional-recommendations">Google’s Python Style Guide - Functional Recommendations</h2>
<p>The first half of Google’s style guide focuses on best practices for using different functionalities within Python. I should note that there are more recommendations than I am giving here - I have selected the items that were relevant to aspects of Python that I already use or want to use more frequently. I would highly recommend glancing through the style guide yourself if you want a more complete picture of Google’s recommendations. But for now, here is what I thought was important [1]:</p>
<p><strong>Use a code linter.</strong> A code linter is a tool that looks at code and identifies possible errors, bugs or sections that are poorly written and could contain syntax errors [2]. Google recommends using a Python library like pylint to check your code before deploying it.</p>
<p><strong>Use import statements for packages and modules but not individual classes or functions.</strong> I think this recommendation helps with namespace management - if you are only importing complete packages/modules, then we will always be able to trace back specific classes or functions to those libraries (i.e. we know that module.class is a class that belongs to “module”). This practice also helps prevent collisions (i.e. having multiple functions with the same name).</p>
<p><strong>Import modules by full pathname location.</strong> This is important for helping the code to find modules correctly. Google recommends writing this:</p>
<p><code class="language-plaintext highlighter-rouge">from doctor.who import jodie</code></p>
<p>Instead of writing this:</p>
<p><code class="language-plaintext highlighter-rouge">import jodie</code></p>
<p><strong>Use exceptions carefully.</strong> Usually exceptions are only for breaking out of the flow for specific errors and special cases. Google recommends using built-in exception classes (like KeyError, ValueError, etc. [3]) whenever possible. You should try to avoid using the “Except:” statement on its own because it will catch too many situations that you probably don’t want to have to handle. On a similar note, try to avoid having too much code in a try-except block and make sure to always end with “finally” to make sure that essential actions are always completed (like closing files).</p>
<p><strong>Do not use global variables.</strong> Global variables can be variables that have scopes including an entire module or class. Python does not have a specific datatype for constants like other languages, but you can still stylistically create them [4], for example by writing them as _MY_CONSTANT = 13. The underscore at the beginning of the variable name indicates that the variable is internal to the module or class that is using it.</p>
<p><strong>It is okay to use comprehensions and generators on simple cases, but avoid using them for more complicated situations.</strong> Comprehensions<em>1 and generators</em>2 are really useful because they do not require for loops, and they are elegant and easy to read. They also do not require much memory. However, complicated constructions of comprehensions/generators can make your code more opaque. Generally, Google recommends using comprehensions/generators as long as they fit on one line or the individual components can be separated into individual lines.</p>
<p><strong>Use default iterators and operators for data types that support them.</strong> Some data types, like lists and dictionaries, support specific iterator keywords like “in” and “not in.” It’s acceptable to use these iterators because they are simple, readable and efficient, but you want to make sure that you do not change a container when you are iterating over it (since lists and dictionaries are <a href="https://sassafras13.github.io/MutvsImmut/">mutable objects</a> in Python).</p>
<p><strong>Lambda functions are acceptable as one-liners.</strong> Lambda functions define brief functions in an expression, such as [7]:</p>
<p><code class="language-plaintext highlighter-rouge">(lambda x: x + 1)(2) = 2 + 1 = 3</code></p>
<p>They are convenient but hard to read and debug. They also are not explicitly named, which can be a problem. Google recommends that if your lambda function is longer than 60 to 80 characters, then you should just write a proper function instead.</p>
<p><strong>Default argument values can be useful in function definitions.</strong> You can assign default values to specific arguments to a function. You always want to place these parameters last in the list of arguments for a given function. This is a good practice when the normal use case for a function requires default values, but you want to give the user the ability to override those values in special circumstances. One downside to this practice is that the defaults are only evaluated <em>once</em> when the module containing the function is loaded. If the argument’s value is <em>mutable</em>, and it gets modified during runtime, then the default value for the function has been modified <em>for all future uses</em> of that function!*3 So the best practice to avoid this issue is to make sure that you do not use mutable objects as default values for function arguments.</p>
<p><strong>Use implicit false whenever possible.</strong> All empty values are considered false in a Boolean context, which can really help with improving your code readability. For example, this is how we would write an implicit false:</p>
<p><code class="language-plaintext highlighter-rouge">if foo: …</code></p>
<p>This is the explicit version, which is not as clean:</p>
<p><code class="language-plaintext highlighter-rouge">if foo != [ ]: …</code></p>
<p>Not only is the implicit approach cleaner, it is also less error prone. The only exception is <strong>if you are checking integers</strong>, when you want to be explicit, i.e.:</p>
<p><code class="language-plaintext highlighter-rouge">if foo == 0: …</code></p>
<p>In this case you want to be clear about whether you want to know if the integer variable’s value is zero, or if it is simply empty (in which case you would use “if foo is None”). Also, remember that empty sequences are false - you don’t need to check if they’re empty using “len(sequence)”.</p>
<p><strong>Annotate code with type hints.</strong> This is especially good practice for function definitions. It helps with readability and maintainability of your code. It often looks like this:</p>
<pre><code class="language-(python3)">def myFunc(a: int) -> list[int]:
…
</code></pre>
<p>That is all for today’s post on functional recommendations in Google’s Python style guide. Next time, I will write more specifically about the stylistic recommendations that Google provides for coding in Python. Thanks for reading!</p>
<h2 id="footnotes">Footnotes</h2>
<p>*1 Comprehensions are a tool in Python that let you iterate over certain data types like lists, sets, or generators. They can make your code more elegant, and allow you to generate iterables in one line of code. The syntax for a comprehension looks like this [5]:
<code class="language-plaintext highlighter-rouge">new_list = [expression for member in iterable]</code></p>
<p>*2 Generator functions are useful for iterating over really large datasets. They are called “lazy iterators” because they do not store their internal state in memory. They also use the “yield” statement instead of the “return” statement. This means that they can send a value back to the code that is calling the generator function, but they don’t have to exit after they have returned, as in a regular function. This allows generator functions to remember their state. In this way generators are very memory efficient but allow for iteration similar to comprehensions [6].</p>
<p>*3 This happened to a classmate of mine once, and he said it almost ruined a paper submission for him. This is covered in detail in [8].</p>
<h2 id="references">References</h2>
<p>[1] “Google Python Style Guide.” <a href="https://google.github.io/styleguide/pyguide.html">https://google.github.io/styleguide/pyguide.html</a> Visited 11 Jul 2021.</p>
<p>[2] Mallett, E. E. “Code Lint - What is it? What can help?” DCCoder. 20 Aug 2018. <a href="https://dccoder.com/2018/08/20/code-lint/">https://dccoder.com/2018/08/20/code-lint/</a> Visited 28 Jun 2021.</p>
<p>[3] “Built-in Exceptions.” The Python Standard Library. <a href="https://docs.python.org/3/library/exceptions.html">https://docs.python.org/3/library/exceptions.html</a> Visited 11 Jul 2021.</p>
<p>[4] Hsu, J. “Does Python Have Constants?” Better Programming on Medium. 7 Jan 2020. <a href="https://betterprogramming.pub/does-python-have-constants-3b8249dc8b7b">https://betterprogramming.pub/does-python-have-constants-3b8249dc8b7b</a> Visited 11 Jul 2021.</p>
<p>[5] Timmins, J. “When to Use a List Comprehension in Python.” Real Python. <a href="https://realpython.com/list-comprehension-python/">https://realpython.com/list-comprehension-python/</a> Visited 11 Jul 2021.</p>
<p>[6] Stratis, K. “How to Use Generators and yield in Python.” Real Python. <a href="https://realpython.com/introduction-to-python-generators/">https://realpython.com/introduction-to-python-generators/</a> Visited 11 Jul 2021.</p>
<p>[7] Burgaud, A. “How to Use Python Lambda Functions.” Real Python. <a href="https://realpython.com/python-lambda/">https://realpython.com/python-lambda/</a> Visited 11 Jul 2021.</p>
<p>[8] Reitz, K. “Common Gotchas.” The Hitchhiker’s Guide to Python. <a href="https://docs.python-guide.org/writing/gotchas/">https://docs.python-guide.org/writing/gotchas/</a> Visited 11 Jul 2021.</p>I just recently learned that Google published a style guide for how their developers write clean code in Python [1]. I wanted to use a couple of posts to outline some of the things I learned from that style guide. I will write this post to describe some of the functional recommendations given in the style guide, and a follow-up post will detail some of the specific style requirements Google listed. Let’s get started!Week of July 5 Paper Reading2021-07-06T00:00:00+00:002021-07-06T00:00:00+00:00http://sassafras13.github.io/WeekJul5Rdg<p>This week I have been interested in reading papers about how to model time series data using unsupervised methods in machine learning. I will briefly summarize a couple of papers on the topic below.</p>
<h2 id="paper-1-velc-a-new-variational-auto-encoder-based-model-for-time-series-anomaly-detection-by-zhang-et-al">Paper 1: VELC: A New Variational Auto Encoder Based Model for Time Series Anomaly Detection by Zhang et al.</h2>
<p>This paper presents a method for finding anomalies in time series data using variational autoencoders. I did not know what <strong>anomaly detection</strong> really was until I read this paper - it is essentially the practice of looking for rare events in the data that are very different from the rest of the dataset, but are likely to be important, not random noise. Anomaly detection can be really difficult to do in a supervised fashion because the size of the anomaly class will generally be much smaller than the size of the “normal” class. But this paper proposes an unsupervised learning approach that side-steps that problem [1].</p>
<p>The authors introduce a VAE that has an additional re-Encoder and Latent Constraint network (VELC) that helps the model tell the difference between normal and anomalous data based on how well the model can reconstruct the input data. The basic idea here is that the model is trained to encode and decode normal data, and as training progresses it will minimize its reconstruction error (i.e. how different the reconstructed data is from the original input data). Then when the model is given a mix of normal and anomalous test data, the reconstruction error should increase dramatically for the anomalous samples as compared to the normal samples, indicating which samples are anomalous. So if the reconstruction error is small, the input is normal; if the error is large, the input is anomalous [1].</p>
<p><img src="/images/2021-07-06-WeekJul5Rdg-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source: [1]</p>
<p>A more detailed view of the VELC model is shown in Figure 1. The VAE itself uses an LSTM as the encoder and decoder, because the LSTM is designed to process time-series data. There is a constraint network that learns the latent space in the VAE alongside the encoder and decoder during training. The purpose of the constraint network is to limit the samples pulled from the latent space during testing to only look like samples it saw during training - in other words, the constraint network ensures that the VELC model only pulls normal samples from the latent space of the VAE. The second re-encoder maps the output of the first decoder to a new latent space. The authors argue that the second re-encoder helps to ensure that the model trains more accurately, and computes more accurate anomaly scores, than it would with just a classical VAE structure [1].</p>
<h2 id="paper-2-a-deep-neural-network-for-unsupervised-anomaly-detection-and-diagnosis-in-multivariate-time-series-data-by-zhang-et-al">Paper 2: A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data by Zhang et al.</h2>
<p>This paper introduces a new model, Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED), which extends the capabilities of VELC so that instead of considering a single time series, we can perform anomaly detection across multiple time series at the same time. (The authors refer to this as <strong>multivariate time series</strong> data.) Zhang et al. argue that their model is the first to simultaneously complete 3 tasks [2]:</p>
<ol>
<li>Anomaly detection: as above</li>
<li>Root cause identification: identifying which time series signal(s) in the input is contributing to to the anomaly</li>
<li>Anomaly severity: giving the user a metric that estimates how strongly the anomaly deviates from normal data</li>
</ol>
<p><img src="/images/2021-07-06-WeekJul5Rdg-fig2.png" alt="Fig 2" title="Figure 2" /> <br />
Figure 2 - Source: [2]</p>
<p>The graphical abstract for this picture is given in Figure 2. The basic idea of this model is similar to the VELC model in that it is also trying to reconstruct an input signal, and using the reconstruction error as an indication of whether or not that input is normal or anomalous. The MSCRED model also uses a variant of an LSTM to handle the time series data, similar to VELC. The difference is that the MSCRED model assumes that there may be useful correlations between different time series signals that we should look for in order to identify anomalies in the full dataset. Let’s take a closer view at the different components of the MSCRED model to understand how it looks for correlations across time series signals as well as within them [2].</p>
<p>The authors assume that the raw data is in the form of <em>n</em> time series that extend for <em>T</em> time; they also assume that the data is normal for time in the interval [0, <em>T</em>] but that the data input to the model after that time can be abnormal. Their real-world example is a power plant that has time series data from different sensors that together can be used to look for anomalies that could be indications of potential failures. But the input to the MSCRED is not the raw time series data. Instead, Zhang et al. compute the pairwise correlations between each time series and save that data in <em>n x n</em> <strong>signature matrices</strong>. It is these signature matrices that then become the input to the model [2].</p>
<p>The first component of the model is a convolutional encoder that is designed to look for correlations across the time series signals (i.e. for correlations between the entries in the signature matrix, where each entry is the correlation between two signals). This convolutional encoder learns to represent the spatial information in the signature matrices, and passes this on to an attention-based convolutional LSTM model (ConvLSTM). The authors explain that they adapted the original ConvLSTM model [3], which was able to learn the temporal information in a video sequence, but struggled to perform over longer time intervals. To mitigate this issue, they add an attention mechanism to the original ConvLSTM which allows it to selectively remember the relevant hidden states across time steps, increasing the memory of the model. Together, the attention mechanism and ConvLSTM are capable of finding both temporal and spatial patterns in the signature matrix, and they return feature maps indicating which elements in time and space are important to pay attention to. The feature maps are processed by a convolutional decoder and used to reconstruct the signature matrices. We use the residual of the signature matrices (i.e. the difference between the input and the output signature matrices) to compute a reconstruction score and identify which inputs are anomalies and which are normal. These scores help identify the anomalies, diagnose which signals contributed to the anomaly (i.e. root cause analysis) and the scores also contain information about the severity of the anomaly [2].</p>
<h2 id="paper-3-learning-to-simulate-complex-physics-with-graph-networks-by-sanchez-gonzalez-et-al">Paper 3: Learning to Simulate Complex Physics with Graph Networks by Sanchez-Gonzalez et al.</h2>
<p>DeepMind delivers again with a beautiful paper on how graph networks can form the basis of a <strong>learned simulator</strong> that can model the physics of a variety of systems (i.e. from water to sand). This paper introduces a Graph Network-based Simulator (GNS) that uses a graph-based representation to model the physics of a system of particles. The authors argue that the value of building a learned simulator with artificial intelligence is that it can a) be built much faster, b) can run more efficiently (both in time and in memory allocation) and c) the simulator remains efficient when scaled up to larger systems [4].</p>
<p><img src="/images/2021-07-06-WeekJul5Rdg-fig3.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Source: [4]</p>
<p>As shown in Figure 3, the overall architecture of DeepMind’s solution relies on a learned simulator that regularly updates the model of the system to accurately recreate the dynamics of a group of particles that represent water or sand or other fluid or rigid systems. In Figure 3, the simulator uses some learned dynamics, $d_{\theta}$, to update the states of all the particles and simulate their trajectories. The learned dynamics, \(d_{\theta}\), use a set of graph networks to learn the dynamics [4]. Let’s dive into the structure of \(d_{\theta}\) in more detail.</p>
<p>The dynamics are modeled using 3 key components: an encoder, a processor and a decoder. The encoder takes as input the first state of all of the particles in the system, and encodes that information into a graph in the latent space. This graph is then passed to the processor, which learns message-passing functions that connect all the nodes in the graph together (within a certain radius) and generates a series of output latent graphs that represent the progression of the system over time. The decoder extracts the relevant dynamics information (i.e. accelerations) from the final latent graph and passes them to an update mechanism, which in this case is a simple Euler integration function. In essence, this entire GNS is still using integration to solve the dynamics by stepping through sequential time steps - the only complexity comes in to how the dynamics are learned and represented and applied. The authors argue that this model is very general and so can be used to represent many different types of particle systems [4].</p>
<p>This all still felt a little general to me until I read deeper into the methods section to understand how each piece (the encoder, the processor and the decoder) were actually implemented. The encoder takes in the position, last <em>C</em> velocities and static properties of each particle and assigns one node to each particle. The encoder then learns functions to embed the input data into the nodes and edges of this graph. These encoder embedding functions are MLPs. The graphs embedded by the encoder are then passed to the processor which has a stack of <em>M</em> graphs with identical structure. The processor learns edge and node update functions, which are also MLPs. Finally, the decoder has a learned function (also an MLP) that is applied to each node on the final graph from the processor and outputs the second derivatives for each node to be passed to the update mechanism [4].</p>
<p>I also wanted to point out that I loved the way that this paper was written. Its figures do an excellent job of giving a high-level view of the architecture that is simple without losing too much resolution. The paper is written so that the reader spirals around the architecture, adding successively more detail on each pass. In total, I believe the authors went through the architecture three times, each time adding more information on the philosophy and implementation of the model design. This is something I would really like to do in my own writing.</p>
<h2 id="references">References:</h2>
<p>[1] Zhang, C., Li, S., Zhang, H., & Chen, Y. (2019). VELC: A New Variational AutoEncoder Based Model for Time Series Anomaly Detection. https://arxiv.org/abs/1907.01702v2</p>
<p>[2] Zhang, C., Song, D., Chen, Y., Feng, X., Lumezanu, C., Cheng, W., Ni, J., Zong, B., Chen, H., & Chawla, N. V. (2018). A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, 1409–1416. https://arxiv.org/abs/1811.08055v1</p>
<p>[3] Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; and Woo, W.c. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 802–810.</p>
<p>[4] Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., & Battaglia, P. W. (2020). Learning to Simulate Complex Physics with Graph Networks.</p>This week I have been interested in reading papers about how to model time series data using unsupervised methods in machine learning. I will briefly summarize a couple of papers on the topic below.Getting Some Intuition for Matrix Exponentials2021-06-11T00:00:00+00:002021-06-11T00:00:00+00:00http://sassafras13.github.io/MatrixExps<p>In a <a href="https://sassafras13.github.io/MLSBasics/">recent post</a>, we talked about some fundamental mathematical operations presented in a robotics textbook written by Murray, Li and Sastry. One of these operations was a <strong>matrix exponential</strong>, which was unfamiliar to me. It turns out that matrix exponentials are a really cool idea that appear in lots of fields of science and engineering. Thanks to a fabulous video by 3Blue1Brown [1], I am going to present some of the basic concepts behind matrix exponentials and why they are useful in robotics when we are writing down the kinematics and dynamics of a robot.</p>
<p>In MLS, the authors explain that matrix exponentials are useful for “map[ping] a twist into the corresponding screw motion” [2]. Recall that a twist is infinitesimally small and the screw contains the full magnitude of the motion [2]. Another way to say this is that the matrix exponential can encode a rotation as a function of the direction of rotation and the angle of rotation [2]. I will explain this in more detail in this post.</p>
<h2 id="basic-definition-of-a-matrix-exponential">Basic Definition of a Matrix Exponential</h2>
<p>A matrix exponential is related to the simpler concept of raising the number <em>e</em> to a real number exponent. We can write this operation as an infinite sum [1]:</p>
\[e^x = x^0 + x^1 + \frac{1}{2}x^2 + \frac{1}{6}x^3 + … + \frac{1}{n!}x^n + …\]
<p>Notice that this expression is a <a href="https://sassafras13.github.io/TaylorSeries/">Taylor series</a>. The sum of the Taylor series approaches the value of \(e^x\) [1].</p>
<p>A matrix exponential is, in a sense, an extension of this idea, using matrices as input \(x\) instead of real numbers. For example, I can rewrite the expression above with \(x = \left[ \matrix{1 & 2 \cr 3 & 4} \right]\) [1]:</p>
\[e^{\left[ \matrix{1 & 2 \cr 3 & 4} \right]} = \left[ \matrix{1 & 2 \cr 3 & 4} \right]^0 + \left[ \matrix{1 & 2 \cr 3 & 4} \right]^1 + \frac{1}{2}\left[ \matrix{1 & 2 \cr 3 & 4} \right]^2 + \frac{1}{6}\left[ \matrix{1 & 2 \cr 3 & 4} \right]^3 + …\]
<p>This still makes sense because I can raise matrices to a real number power by multiplying the matrix by itself <em>n</em> times. And in general, this infinite series will always approach a stable value - in this case, a stable matrix [1].</p>
<p>The matrix exponential is useful in mathematics when we are trying to solve a system of differential equations. For example, let’s say I want to find expressions for \(x(t)\), \(y(t)\) and \(z(t)\) given the equations below [1]:</p>
<p>\(\frac{dx}{dt} = a \cdot x(t) + b \cdot y(t) + c \cdot z(t)\) <br />
\(\frac{dy}{dt} = d \cdot x(t) + e \cdot y(t) + f \cdot z(t)\) <br />
\(\frac{dz}{dt} = g \cdot x(t) + h \cdot y(t) + i \cdot z(t)\)</p>
<p>I can use a matrix exponential to find the coefficients of the functions [1]:</p>
\[e^{\left[ \matrix{a & b & c \cr d & e & f \cr g & h & i} \right]t}\]
<p>More generally, if I have a system of equations \(X(t)\) and a matrix of coefficients \(M\) then I can solve the following differential equation written in terms of linear algebra to find expressions for all the functions contained in \(X(t)\) [1]:</p>
\[\frac{d}{dt}X(t) = MX(t)\]
<h2 id="remembering-e-and-its-derivative">Remembering e and Its Derivative</h2>
<p>There is another expression that looks remarkably similar to the one shown just above. Specifically, the derivative of \(e\) has the same form as the equation that can be used to solve a system of differential equations [1]:</p>
\[\frac{d}{dt}e^{rt} = re^{rt}\]
<p>(Keep in mind that we need to also take into account initial conditions if we want to find the solution to a specific system of equations [1].)</p>
<p>Bringing it all together, it can be shown (check out the last part of [1]) that the derivative of the matrix exponential follows the same form as the derivative of e when raised to a real number, that is [1]:</p>
\[\frac{d}{dt} e^{Mt} X_0 = M \big( e^{Mt} X_0 \big)\]
<h2 id="showing-that-the-definition-of-a-matrix-exponential-is-correct">Showing that the Definition of a Matrix Exponential is Correct</h2>
<p>Let’s take a simple example of how the matrix exponential is used to encode rotations, which we mentioned earlier is one of the reasons why they are so useful in robot kinematics. When we find a matrix that correctly encodes a given rotation, we will see that it is typically a skew-symmetric matrix which can be used to convert between screws and twists [2].</p>
<p>Consider the matrix, \(\left[ \matrix{0 & -1 \cr 1 & 0} \right]\). This matrix is a solution to the following system of equations [1]:</p>
\[\frac{d}{dt} \left[ \matrix{x(t) \cr y(t)} \right] = \left[ \matrix{0 & -1 \cr 1 & 0} \right] \left[ \matrix{x(t) \cr y(t)} \right]\]
<p>Geometrically, this expression indicates that the rate of change of \(\left[ \matrix{x(t) \cr y(t)} \right]\) is tangent to the direction of \(\left[ \matrix{x(t) \cr y(t)} \right]\) and has the same magnitude (this is shown in Figure 1) [1].</p>
<p><img src="/images/2021-06-11-MatrixExps-fig1.png" alt="Fig 1" title="Figure 1" /> <br />
Figure 1 - Source [1]</p>
<p>But mathematically, why does this make sense? If we compute the Taylor series of \(e^{\left[ \matrix{0 & -1 \cr 1 & 0} \right]t}\), we will find that each term in the matrix becomes an infinite sum with a specific pattern as follows [1]:</p>
\[e^{\left[ \matrix{0 & -1 \cr 1 & 0} \right] t} = \left[ \matrix{ 1 - \frac{t^2}{2!} + \frac{t^4}{4!} - \frac{t^6}{6!} + … & -t + \frac{t^3}{3!} - \frac{t^5}{5!} + \frac{t^7}{7!} - … \cr t - \frac{t^3}{3!} + \frac{t^5}{5!} - \frac{t^7}{7!} + … & 1 - \frac{t^2}{2!} + \frac{t^4}{4!} - \frac{t^6}{6!} + …} \right]\]
<p>And guess what? Those infinite sums are exactly the Taylor series for the sine and cosine functions [1]:</p>
\[e^{\left[ \matrix{0 & -1 \cr 1 & 0} \right] t} = \left[ \matrix{ cos(t) & -sin(t) \cr sin(t) & cos(t)} \right]\]
<p>This is also the expression for a 90 degree rotation counterclockwise with some angle \(t\) [1]. How cool is that? So now we have direct mathematical proof that the matrix exponential of \(\left[ \matrix{0 & -1 \cr 1 & 0} \right]\) is exactly the 90 degree rotation matrix. This is why the matrix exponential is so useful in robot kinematics, because the matrix \(\left[ \matrix{0 & -1 \cr 1 & 0} \right]\) encodes useful information about the rotation in a skew-symmetric form [1].</p>
<h2 id="references">References:</h2>
<p>[1] “How (and why) to raise e to the power of a matrix, DE6.” 3Blue1Brown. 1 Apr 2021. <a href="https://www.youtube.com/watch?v=O85OWBJ2ayo&list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6&index=6">https://www.youtube.com/watch?v=O85OWBJ2ayo&list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6&index=6</a> Visited 11 Jun 2021.</p>
<p>[2] Murray, R., Li, Z., Sastry, S. “A Mathematical Introduction to Robotic Manipulation.” CRC Press. 1994.</p>In a recent post, we talked about some fundamental mathematical operations presented in a robotics textbook written by Murray, Li and Sastry. One of these operations was a matrix exponential, which was unfamiliar to me. It turns out that matrix exponentials are a really cool idea that appear in lots of fields of science and engineering. Thanks to a fabulous video by 3Blue1Brown [1], I am going to present some of the basic concepts behind matrix exponentials and why they are useful in robotics when we are writing down the kinematics and dynamics of a robot.Spatial Velocity vs. Body Velocity2021-06-10T00:00:00+00:002021-06-10T00:00:00+00:00http://sassafras13.github.io/SpatialvsBodyVelocity<p>As I have been learning the mathematics behind robot kinematics, I have been struggling to understand the difference between a spatial velocity and a body velocity. In this post, I am going to try to write a good definition of each of these velocity types.</p>
<h2 id="basic-definitions">Basic Definitions</h2>
<p>According to Murray, Li and Sastry (MLS), the definitions are as follows [1]:</p>
<ul>
<li>
<p><strong>Spatial velocity</strong>:- mathematically defined as \(\hat{V}^s = \dot{R}R^T = \dot{g} g^{-1}\). MLS says that the spatial velocity of a rigid motion is the instantaneous velocity of the body as viewed in the spatial frame. It is the “velocity of a (possibly imaginary) point on the rigid body which is traveling through the origin of the spatial frame at time <em>t</em>” [1]. So if you are standing at the origin of a spatial frame and you measure the velocity of a point attached to the rigid body and going through the origin where you are standing, that is the spatial velocity [1].</p>
</li>
<li>
<p><strong>Body velocity</strong>:- mathematically defined as \(\hat{V}^b = R^T \dot{R} = g^{-1} \dot{g}\). MLS says that the body velocity is more “straightforward” to understand; it is the “velocity of the origin of the body coordinate frame relative to the spatial frame, as viewed in the current body frame” [1]. MLS points out that the body velocity is not the velocity of the body relative to the body frame, because that is always zero [1]!</p>
</li>
</ul>
<h2 id="body-velocity">Body Velocity</h2>
<p>These definitions were a little confusing to me at first. Let’s take a step back and think about this with some more intuition. Let’s start with the body velocity, because I can relate that easily to the experience of riding in a car. If you imagine that you are riding in a car at a constant velocity, then the only way you know that you are moving is if you compare your frame of reference (the <strong>body frame</strong>, B) with some fixed frame (the <strong>spatial frame</strong>, A). The body velocity is your instantaneous velocity as measured with respect to this fixed frame, A, but observed from your perspective inside the car’s body frame, B. So when your car’s speedometer tells you that you are moving at 60 MPH, that means that you are moving at 60 MPH with respect to the spatial frame, A, as observed from within the body frame, B [2].</p>
<p>We can define your body velocity in this instance as follows. Let’s assume that your position in the car is a point \(q\) and so your body velocity is written as [2]:</p>
\[v_{q_b} = g_{ab}^{-1} v_{q_a}\]
<p>Where \(g_{ab}\) is the transformation that converts a point in the body frame, B, to the spatial frame, A, and \(v_{q_a}\) is the velocity of the point \(q\) in the spatial frame, A. Another way to write \(v_{q_a}\) is \(v_{q_a} = \dot{g}_{ab} q_{b}\) *1 so we can substitute that in:</p>
\[v_{q_b} = g_{ab}^{-1} \dot{g}_{ab} q_{b}\]
<p>Now I can write my body velocity as a function of <strong>some point \(q_b\) in the body frame that has velocity with respect to the spatial frame, A, as viewed in the body frame, B</strong> [2].</p>
\[v_{q_b} = \hat{V}_{ab}^b q_{b}\]
<p>In other words:</p>
\[\hat{V}_{ab}^b = g_{ab}^{-1} \dot{g}_{ab}\]
<h2 id="spatial-velocity">Spatial Velocity</h2>
<p>The difference between the spatial velocity and the body velocity is that, instead of observing your velocity in the car in the body frame, B, you are observing it from the origin of the spatial frame, A. That is, the <strong>spatial velocity measures the velocity of a point fixed to the body frame, B, with respect to the spatial frame, A, and observed from the spatial frame, A</strong> [2]. I can write the spatial velocity as follows [2]:</p>
\[v_{q_a} = \dot{g}_{ab} q_b\]
<p>I can rewrite \(q_b\) to be in terms of \(q_a\) as follows [2]:</p>
\[q_b = g_{ab}^{-1} q_a\]
<p>And therefore I can write the spatial velocity as [2]:</p>
\[v_{q_a} = \dot{g}_{ab} g_{ab}^{-1} q_a\]
<p>And so we see that [2]:</p>
\[\hat{V}_{ab}^s = \dot{g}_{ab} g_{ab}^{-1}\]
<p>I hope this helps a little bit with getting to grips with the differences between a spatial velocity and a body velocity. I think that it all comes down to where you are observing the velocity from.</p>
<h2 id="footnotes">Footnotes:</h2>
<p>*1 This is by the definition of the <a href="https://sassafras13.github.io/MLSBasics/">rigid motion mapping</a>, that is [2]:</p>
\[q_{a} = g_{ab} q_b\]
<p>And the derivative is [2]:</p>
\[\dot{q}_a = \dot{g}_{ab} q_b\]
<h2 id="references">References:</h2>
<p>[1] Murray, R., Li, Z., Sastry, S. “A Mathematical Introduction to Robotic Manipulation.” CRC Press. 1994.</p>
<p>[2] Sastry, S. “EE106A Discussion 7: Velocities and Adjoints.” EE C106A: Introduction to Robotics, Fall 2019 course notes. UC Berkeley. <a href="https://ucb-ee106.github.io/ee106a-fa19/assets/discussions/D7___Velocities_and_Adjoints.pdf">https://ucb-ee106.github.io/ee106a-fa19/assets/discussions/D7___Velocities_and_Adjoints.pdf</a> Visited 10 Jun 2021.</p>As I have been learning the mathematics behind robot kinematics, I have been struggling to understand the difference between a spatial velocity and a body velocity. In this post, I am going to try to write a good definition of each of these velocity types.Week of May 31 Paper Reading2021-06-02T00:00:00+00:002021-06-02T00:00:00+00:00http://sassafras13.github.io/weekMay31rdg<p>This week we return to reading some papers about DNA nanotechnology that my PI recommended to me a while ago and I hadn’t read before now.</p>
<h2 id="paper-1-rolling-up-gold-nanoparticle-dressed-dna-origami-into-three-dimensional-plasmonic-chiral-nanostructures-by-shen-et-al-1">Paper 1: Rolling Up Gold Nanoparticle-Dressed DNA Origami into Three-Dimensional Plasmonic Chiral Nanostructures by Shen et al. [1]</h2>
<p>Today I learned that there is a field of nanotechnology focused on arranging metallic nanoparticles in very precise structures so that they can affect light on the visible spectrum. The authors of this paper demonstrate one such approach that uses DNA origami to arrange gold nanoparticles into helical 3D structures. The idea is that DNA origami is a bottom-up manufacturing approach that has more resolution, precision and flexibility than top-down manufacturing methods like lithography. The authors showed that their structure achieved plasmonic resonance for light with a wavelength of approximately 525nm [1].</p>
<p>One thing the authors did that was particularly interesting to me was that they used a multistage assembly process to make their structure. First, they folded a flat rectangle of DNA origami using one annealing process; then they did a secondary anneal that attached the gold nanoparticles to the sheet, and then finally they performed a tertiary assembly process that rolled the sheet into a tube. The one detail that was missing was what the temperature for the tertiary assembly process was. The secondary anneal ended at room temperature, so perhaps the tube was rolled at room temperature as well, but I would be curious to know for sure what the authors did [1].</p>
<h2 id="paper-2-dna-origami-meets-polymers-a-powerful-tool-for-the-design-of-defined-nanostructures-by-hannewald-et-al-2">Paper 2: DNA Origami Meets Polymers: A Powerful Tool for the Design of Defined Nanostructures by Hannewald et al. [2]</h2>
<p>This is a detailed review of the current state of the art in building polymers using DNA origami. They present different strategies for building polymers out of DNA or for using DNA as a templating material for building polymers out of other materials. They indicate that this is an emerging area of research because there are not a large number of papers in the field yet, and they do a good job of highlighting some of the challenges involved that are preventing the field from really taking off [2].</p>
<p>The challenges range from manufacturing-related to characterization of the results of manufacturing. For example, the authors point out that often the steric hindrance of the materials in use can make it difficult for robust assembly to occur. Another challenge they highlighted was that often the yields of the assembly process are too low for characterization using gel electrophoresis, NMR or DLS [2].</p>
<p>## References:</p>
<p>[1] Shen, X., Song, C., Wang, J., Shi, D., Wang, Z., Liu, N., & Ding, B. (2012). Rolling up gold nanoparticle-dressed dna origami into three-dimensional plasmonic chiral nanostructures. Journal of the American Chemical Society, 134(1), 146–149. https://doi.org/10.1021/ja209861x</p>
<p>[2] Hannewald, N., Winterwerber, P., Zechel, S., Ng, D. Y. W., Hager, M. D., Weil, T., & Schubert, U. S. (2021). DNA Origami Meets Polymers: A Powerful Tool for the Design of Defined Nanostructures. In Angewandte Chemie - International Edition (Vol. 60, Issue 12, pp. 6218–6229). John Wiley and Sons Inc. https://doi.org/10.1002/anie.202005907</p>This week we return to reading some papers about DNA nanotechnology that my PI recommended to me a while ago and I hadn’t read before now.