Home » Research » Blog » Highlights of “Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study”




Attribution-NonCommercial-ShareAlike 4.0 International

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Highlights of “Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study”

In this blog post, we summarize, using code examples, our recent empirical study on challenges in migrating imperative Deep Learning programs to graph execution.


Systems using Machine Learning (ML), especially Deep Learning (DL), are widely used in today’s society. Dynamic models, whose behaviors are determined by input data, are at the heart of these systems. Therefore, as datasets expand, effectiveness becomes crucial to maintain the responsiveness of ML systems. As a result, DL frameworks must allow simple programming paradigms and efficiently conduct sophisticated computations on big datasets.

DL frameworks have historically favored the deferred-execution technique for efficiency. Development is error-prone, time-consuming, and challenging to debug despite being scalable. On the other hand, eager execution-encouraging imperative DL frameworks have become more natural, less prone to mistakes, and simpler to debug. However, the efficiency and scalability of these eagerly run imperative DL programs are lower. Therefore, a hybrid strategy aiming to combine the “best of both worlds” has arisen and has been adopted by popular DL frameworks.

This hybrid approach’s challenges are not fully understood. Therefore, we studied 250 open-source projects, 470 carefully reviewed code patches, and 446 bug reports to conduct a data-driven investigation of the difficulties in building reliable yet performant imperative DL programming. We focus our study on TensorFlow since it is one of the most popular DL frameworks.

In this blog post, we will explain and exemplify:

  • Deferred execution, eager execution, graphs, and hybridization.
  • Why this study is essential.
  • Challenges developers face when using this hybrid approach.
  • Some best practices, recommendations, and anti-patterns for using this hybridization technique.

What Is Deferred Execution?

Popular DL frameworks have historically embraced deferred execution-style APIs, making DNNs straightforward to execute as symbolic graphs that enable various run-time optimizations.

Graph-Based Execution

Graph execution extracts tensor computations from Python and builds an efficient graph before evaluation. Graphs, or tf.Graph objects, are unique data structures with tf.Operation and tf.Tensor objects. While tf.Operation objects represent computational units,tf.Tensor objects represent data units. Graphs can be saved, run, and restored without the original Python code.

Having graphs brings a lot of flexibility since you can use your TensorFlow graph in environments where you don’t have your Python interpreter. Additionally, graphs are easily optimized, allowing transformations such as constant folding.

Example 1: TensorFlow deferred-execution style code

# Build a graph
a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b

# Launch graph in a session
sess = tf.Session()

# Evaluate the tensor 'c'

In example 1, we can see on lines 2-4 that we are building the computational graph. This line 4 does not execute until the Session is created on line 6 and is run on line 8. This code snippet shows that this is not native Python-style code; therefore, it might be cumbersome to use and error-prone. This way of programming does not natively support common imperative program constructs, e.g., iteration.

Problems With Graph-Based Execution

As presented briefly above, this deferred execution style has some drawbacks. While scalable, such development tends to produce DL code that is:

  • Error-prone,
  • Non-intuitive, and
  • Difficult to debug.

Additionally, because graph computation executes statements in a non-imperative order, traditional Software Engineering (SE) tools cannot help troubleshoot bugs.

What Is Imperative-Style Code?

The main goal of imperative programming is to detail how a program should execute step-wise. Therefore, more natural, less error-prone, and easier-to-debug imperative DL frameworks encouraging eager execution have emerged as a response to the deferred style execution. Referencing example 1, with eager execution, line 4 would execute and immediately evaluate tensor c without needing a session.

Benefits and Disadvantages of Imperative Style Code

The benefits of using imperative style code:

  • Easier to debug.
  • Less error-prone.
  • More natural to Python developers.

Though ubiquitous, eagerly-executed imperative DL programs are less efficient and scalable than their deferred execution counterparts.

How Can We Get the “Best of Both Worlds” of These Paradigms?

Hybrid techniques have developed due to an attempt to combine these two paradigms to achieve the “best of both worlds.” These hybrid methods have been incorporated into popular deep learning frameworks such as TorchScript for PyTorch, AutoGraph for TensorFlow, and Hybridize for MXNet. This hybrid method executes run-time imperative DL applications as static graphs.

For example, in TensorFlow —a popular DL framework—AutoGraph can potentially enhance performance by decorating (annotating)—with optional yet influential decorator arguments—appropriate Python function(s) with @tf.function. Decorating functions with such hybridization Application Programming Interfaces (APIs) can increase imperative DL code performance without explicit modification.

Example 2: TensorFlow imperative (OO) DL model code

class SequentialModel(tf.keras.Model):
    def __init__(self, **kwargs):
        self.flatten = layer.Flatten(input_shape=(28,28))
        num_layers = 100 # Add many small layers.
        self.layers = [layers.Dense(64,  activation = "relu" for n in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(0.2)
        self.dense_2 = tf.keras.layers.Dense(10)

    @tf.function(...) # Executes the model as a graph (with optional args).
    def __call__(self, x):
        x = self.flatten(x)
        for layer in self.layers:
            x = layer(x)
            x = self.dropout(x)
            x = self.dense_2(x)
        return x

In Example 2, we have TensorFlow imperative (OO) DL code representing a model for classifying images. On line 11, AutoGraph is being used to potentially enhance performance by decorating the call method on line 12. The call() 's execution is traced at run-time, generating an equivalent graph. A speedup of ~9.22 over five runs is seen in this case. This listing shows how to use this hybridization technique with the hope of run-time performance speedup.

What Are the Drawbacks of This Hybridization Technique?

This hybridization comes with some drawbacks necessary to understand to use it efficiently. This technique:

  • Need non-trivial, specialized metadata.
  • Exhibit limitations and known issues with native program constructs.
  • Subtle considerations are required to:
    • Make code amenable to safe, accurate, and efficient graph execution.
    • Avoid performance bottlenecks and semantically inequivalent results.

These considerations significantly strain developers since they must manually indicate which functions should be changed and make their code compliant with the underlying execution model translation.

Example 3: Imperative TensorFlow Code Using a Counter

class Model(tf.Module):
    def __init__(self):
        self.v = tf.Variable(0)
        self.counter = 0

    def __call__(self):
        if self.counter == 0: 
            self.counter += 1
        return self.v

m = Model()
for n in range(3):

What would you expect the results to be in line 15 from Example 3?

Actual output:


But, we were expecting:


In Example 3, a model uses a counter to safeguard a variable incrementation. When line 14 is run, the counter’s initial value in line 4 is captured during tracing. The capturing is due to call() being decorated with tf.function, enabling graph execution. Since the initial value is captured during the first invocation, variable v is incremented unconditionally on line 10 each time the model is invoked.

These considerations significantly strain developers since they must manually indicate which functions should be changed and make their code compliant with the underlying execution model translation.

Problems like those we see in Example 3 are common in migrating to graph execution. Consequently, this can result in suspicious results or lower performance if you safeguard expensive operations. Thus, this technique is not as straightforward and confirms the need to circumvent its drawbacks.

What Did We Find?

Quantitative Results

API Misuse233053 
TensorFlow Bug41822 
Exposed variable state112 
Compilation error101 
Numerical errors101 
Segmentation fault101 
Table 1: Discovered top-level problem categories

The results of the challenges were divided into various (top-level) problem categories can be seen in Table 1. In contrast, Figure 1 shows a hierarchical classification of the 280 tf.function-related challenges found in our subjects, with varying levels of detail. Problem category names are used to identify the obstacles, and their counts follow. Abstract categories would be those who merely group other categories.

The top-level categories include performance, misuse of the API, and incompatibility between eager and deferred execution modes, where tf.function is utilized in a situation where graph conversion is not supported. Other problem categories include fixing unresolved issues with TensorFlow’s tf.function and “other,” which involves refactoring, general cleanup, and syntax corrections. “Unknown” refers to instances where we could not determine the problem category without additional domain expertise or developer input. Code changes involving tf.function appearing in tests were categorized as “Test.” Debuggability refers to circumstances in which utilizing tf.function to boost DL code performance may, in turn, make it harder for developers to debug it quickly.

Even though the other categories have more minor counts, they may still have a significant impact. For example, an exposed variable state occurs when saving (exposed) program state (variables) is problematic during tf.function conversion at run-time. Numerical errors involve possible numeric overflow. AutoGraph compilation errors surface when tf.functions are compiled and subsequently result in compilation errors. Segmentation fault is when using tf.function causes a program crash. While compilation and numerical errors and segmentation faults may be considered symptoms, we focus on tf.function client usage; these categories represent problems from a client’s perspective.

Figure 1: Discovered problem categories (hierarchical).

In summary, we found that hybridization is prone to API Misuse, can result in performance degradation—opposite of its intention—and has limited application due to execution mode incompatibility. Specifically, our findings are:

  1. Performance was the most prominent problem category encompassing tf.function usage.
  2. Despite the intent to improve performance, tf.function causes performance degradation in ~7% of cases.
  3. Adding @tf.function fixed only 55% of imperative DL code performance problems. The remaining were due to using tf.function.
  4. Performance fixes entailed altering developer-supplied tf.function arguments at a rate of 25%.
  5. API misuse was the second-largest problem category.
  6. API misuse was caused by developers not understanding hybridization APIs at a rate of 37%.
  7. To fix API misuse, tf.function was removed 28% of the time. In 46% of these, hybridization was abandoned due to PAI confusion, with 62% causing run-time errors.
  8. Execution mode incompatibility was the third largest problem category.
  9. Incompatibility problems led to run-time errors or unexpected results 81% of the time.
  10. TensorFlow bugs made up ~8% of problems. Of these, 9% involved deadlocks.

Issues vs. Commits

Problem category comparison.
Figure 2: Top-level problem category comparison.

Figure 2 illustrates the various sources of problem categories, including commits (blue bars) and GitHub issues (red bars). Performance, the most prominent problem category, was 2/3 more likely to be addressed in commits than in issues. The reason may be that enhancing performance typically requires a code change, which can be benchmarked.

In contrast, issues were 2/3 more likely than commits to be categorized with incompatibility. Incompatibility is more challenging to quantify, often resulting in unexpected behavior or run-time errors. Therefore, developers may be more likely to seek external assistance. Notably, as opposed to commits, GitHub issues had 80% more TensorFlow Bug problems. Finally, we discovered all “Test” and “Unknown” difficulties in commits.

Qualitative Results

This section highlights the bug patterns with examples, summarizes causes, symptoms, and fixes, and proposes preliminary best practices and anti-patterns.

Commit 02a3f297 in DDPG-tf2

+  tf.TensorSpec(shape=(None, self.num_states), dtype=tf.float32),
+  tf.TensorSpec(shape=(None, self.num_actions), dtype=tf.float32),
+  tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+  tf.TensorSpec(shape=(None, self.num_states), dtype=tf.float32),])
  def update_weights(s, a, r, sn): # ...

The commit listed above portrays an underspecified input signature, one of the most used tf.function parameters we observed. Lines 2-6 fix a performance regression by adding an input signature to a weight distribution tf.function, avoiding unnecessary retracing. Such retracing may slow down model training significantly. The sequence of tf.TensorSpecs specifies the intended tensor shapes and data types (dtypes) supplied to update_weight().

From this, we proposed the following best practice:

If possible, supply an input signature argument to tf.function with the intended shape and types of any input tensors to avert retracing—a practice similar to providing type annotations to variables in dynamic languages to assist with type inferencing.

We continuously see that developers struggle to use this technology. Some users even distrusted this technology, stating:

“it does far too much hidden magic”

2020. Added jitted ncon. Pull request #623. google/TensorNetwork. Xanadu. (May 26, 2020). Retrieved 01/10/2022 from https://git.io/J9cMx.

Additionally, we saw developers who know the advantages of using tf.function but decided against it because of “instability:”

“removing the decorator is not ideal, but stability is more important than the [speedup] we [would] get with [it]”

2019. FIX: dense image warp bug. (April 17, 2019). Retrieved 01/13/2022 from

We also observed a GitHub issue where they had a problem with hybridizing inner functions. The user tried to use a tf.function at lower-level functions, but the further it was “moved inside,” the slower the top-level function became. The fix involved using the tf.function at the top level, making it run ~25-40% faster. The root cause of this problem was that embedded functions retrace the function multiple times because its scope is not publicly visible, and the graphs cannot be cached. As a modularity mechanism, function nesting is a common mechanism in Python, but currently, TensorFlow documentation does not mention this as an issue. From this GitHub issue, we propose the following anti-pattern:

Hybridizing nested functions may cause performance degradation. Sacrifice modularity by hybridizing top-level functions or refactoring at the top.

There are more anti-patterns and best practices in our paper. Please check it out to see them all.


We studied hybridization bugs and challenges in migrating imperative DL code to graph execution.

We found that:

  • Despite the purpose of hybridization, we must take care to avoid performance degradation.
  • Not all imperative DL code is amenable to graph conversion.
  • Must be aware of which code is running in graph mode and which is not.

This work contributes to understanding the development difficulties associated with converting imperative DL code to graph execution via hybridization.

What Does This Mean?

We hope that our results will assist API designers, tool developers, Deep Learning practitioners, and educators in using and teaching how to write reliable yet performant (imperative) Deep Learning programs. Further examining other developer resources, like Stack Overflow, may also help to uncover further challenges developers and data scientists face when using hybridization. Finally, future research is necessary to discover ways to improve the state-of-the-art in hybrid imperative Deep Learning model programming.

More Details

There are more details of this work in the paper! Please check it out if you are interested.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.