Nat Z

Posted on Feb 2

Basic Datadog concepts for the busy TypeScript developer

#performance #typescript #monitoring #datadog

At one point in my career, I joined a project that heavily utilized Datadog for monitoring and error tracking. The project was a complex one, and the Datadog setup that came with it was equally so. At first, I thought I could learn the platform intuitively, but I quickly realized that it would not be enough, so I immediately turned to official docs. Unfortunately, I also quickly realized that while the documentation was extensive, it lacked practical guidance for a quick ramp-up. This article aims to provide a concise guide for those who find themselves in a similar situation, needing to navigate Datadog’s features effectively and analyze errors emerging on a project. It is also worth saying that this article is NOT a comprehensive guide to setting up Datadog, but rather a quick reference for those who need to get up to speed quickly on an existing project.

This article uses TypeScript to illustrate code examples and although the concepts are generally applicable to other setups, some caveats are Node.js/ TypeScript specific, so bear that in mind if you are using a different language or framework.

Basic concepts

The first few concepts we are often faced with when working with Datadog are instrumentation, logs, APM (Application Performance Monitoring), traces, and spans. I will first try to briefly explain each of these concepts so that we can later understand how they relate to each other.

Instrumentation

The first term I want to explain is instrumentation, as it will get tossed around a lot.

Definition from the official docs states the following: Instrumentation is the process of adding code to your application to capture and report observability data. In layman’s terms, it means using a tool like Datadog to encompass and monitor your actual application. Think of it as embedding your application with Datadog in order to collect data about its performance and behavior. Once this is done, Datadog can collect data from your application, which can then be analyzed to gain insights into its performance, identify bottlenecks, and troubleshoot issues.

Services and Resources

A service in Datadog represents an organisational unit, for example a group of URL endpoints (API services), queries in a database service or jobs executed periodically (cron services). A resource represents a singular instrumented web endpoint, database query, or background job.

Logs

Logs are exactly what you think they are, logs that your application generates. Unlike some other data that Datadog collects automatically (like CPU usage, memory usage, etc), logs need to be explicitly sent from your application to Datadog, meaning you need to send a log from somewhere in your code, which is then picked up by Datadog through log collection setup and for collecting logs for Node.js applications, a specific library called Winston needs to be used.

Bear in mind, the logs are highly customizable, so you can log anything you want, but you should be careful not to log sensitive data. Make sure your logging payload is in JSON format as that enables Datadog to automatically parse the log messages to extract log attributes on their platform. It will enable you to effectively analyze your logs inside their Log explorer.

The Log explorer is the main interface for searching, filtering, and analyzing logs within Datadog and can be accessed via the Logs tab on the main Datadog interface.

When you click on any of the log entries in the Log Explorer, you will see the side panel with full log details, including all the attributes associated with that log entry. This is useful for drilling down into specific logs to understand their context and content. Here in the detailed view, alongside fields and attributes, you will also see a tab called “Trace” — make a note of this as it will be important later when we discuss correlating logs and APM traces.

When you go to the log explorer, you can use the search bar to filter logs based on various attributes. Attributes like env, service, host, source, status, trace_id and message are so called ‘reserved attributes’ and they are automatically ingested by Datadog’s collection pipeline.

You can also create custom attributes by including additional key-value pairs in your JSON-formatted log message payloads — Datadog is going to parse any JSON field you include in your log message and extract it as a custom attribute. These attributes should be used in your search queries with an ‘@’ symbol before them, e.g. @user_id:12345.

If you forget, look at the Facet panel on the left side of the Trace/Log Explorer. If a field is listed there, it usually doesn’t need the ‘@’. You can find more info on attributes search here.

APM

APM stands for Application Performance Monitoring and it is a way to monitor and manage performance and availability of software applications. It begins with instrumentation, and is followed by setting up and running the Datadog agent. Your application connects to the agent through the Datadog library running within your application, and then Datadog agent sends the data (such as response times, error rates, and throughput) it receives from your app to the Datadog platform. This data is then visualized in the Datadog APM dashboard, where you can analyze it to identify performance bottlenecks, errors, and other issues affecting your application’s performance. Like Log explorer, it has its own APM explorer.

In Datadog APM, the main units used to analyze an application’s performance are traces and spans.

Trace definition from the official documentation states: ‘A trace is used to track the time spent by an application processing a request and the status of this request.’, but this definition can be somewhat misleading, as a trace represents more than just time spent processing a request.

A trace (performance trace) represents the entire journey, or a lifespan, of a request as it travels through various services and components of your application. It provides a high-level overview of how the request was processed, including the time taken at each step. We can say that it tracks how code that is behind an endpoint performs when that endpoint is hit. This is strictly related to the APM tab in the Datadog console. There may or may not be logs generated during the performance trace, which depends on whether the code being executed during the trace generates logs or not.

To view a list of traces for a service, go to the APM explorer, choose a service from the offered list and go to the Traces tab. By clicking any of the traces you will be able to see the trace details.

A trace is a collection of spans, which represent a single unit of work (e.g., a function call or a SQL query) within the transaction as a trace flows through various services and components of a distributed system.

A span can be viewed as a single operation within a trace, representing a specific unit of work or a step in the overall process, and as such can provide information about your application performance in a step-by-step manner.

Both traces and spans have unique identifiers (trace ID and span ID) that are injected as HTTP headers and enable effective tracking and correlation of requests across different services.

Spans have tags and attributes. These two terms often get used interchangeably but the docs quote the distinction as following:

Span tags provide context related to the span. For instance, host or container tags on the infrastructure the service is running on. You can find the tags associated with a span in the Infrastructure tab of the span details section in the detailed trace view.

More info on span tags and attributes can be found here and here.

Span attributes are the content of the span, collected with automatic or manual instrumentation in the application, and they can be found under the Overview tab of the span details section in the detailed trace view.

To search on a specific span attribute among traces, you must prepend an @ character at the beginning of the attribute key, while tags do not need an ‘@’.

The wonderful thing about spans is that you can manually add tags to the spans in your code, enriching them and providing more context for later analysis. This is done through the tracer object provided by the dd-trace library — more on that later.

You can find more information on adding tags via this link.

Error Tracking

Error tracking will happen automatically once an application is instrumented. As with other telemetry (gathered data), it is also important to make a distinction between error logs and error spans. Error logs are log entries generated from our application inside the code blocks designed to catch errors (try/catch blocks, error handling middlewares, etc). Error spans are spans that Datadog creates when an error occurs during the execution of our application code.

Error Tracking through APM

When an instrumented application throws an error, the Datadog tracer captures the error details and creates an error span.

In order to provide information about the error, the span must contain the following attributes: error.stack, error.message, and error.type (or error.kind). Otherwise, you will encounter incomplete spans with missing data like the following:

If it doesn’t, we can add those attributes manually to the span in our code using the tracer object provided by the dd-trace library.

It is important to note that stack traces must be valid inside error.stack, with at least two lines and one meaningful frame. If this isn’t the case, it is possible to provide a default stack trace manually to the span in our code using the tracer object.

Error tracking through Logs

To enable Error Tracking, logs must include the following attributes: an error.kind or an error.stack, a ‘service’ attribute representing the service from which the log originates and a ‘status’ level of ERROR, CRITICAL, ALERT, or EMERGENCY. By default, integration pipelines attempt to remap default logging library parameters to those specific attributes and parse stack traces or trace back to automatically extract the error.message and error.type (error.kind). An example for Node.js can be found here.

Having all the necessary properties for spans and logs also helps Datadog Error Tracking connect spans and logs. If they are connected correctly, then we can easily switch between a log entry and the trace that led to that log entry by clicking the ‘Trace’ tab in the log details panel inside the Log Explorer.

The same goes for error spans inside traces — we can click the log tab in the span details panel to see the corresponding log entries for that error span. This provides a seamless way to navigate between logs and traces when investigating errors in our application.

Correlating Logs and APM Traces

Some of the features explained in this last section were already mentioned above, but I felt it was important to dedicate a separate section to it in order to drive some points home and also to provide some code examples now that we’ve covered the basic concepts.

To correlate your log in the code with APM traces and spans you need to format your log payload as a JSON object, because that is the only way the log payload will be extracted automatically by Datadog and parsed as searchable attributes.

The full Datadog setup can entail Docker configuration, environment variables and so much more, but for the scope of this article we will stick to the configuration of the tracing library (dd-trace) as this is something you as a developer would have immediate insight into once you start working on a codebase and it would likely be something you would have power over to tweak if you feel your instrumentation setup could be improved in order to make sure you are capturing the desired data from the application.

We first want to check if the instrumentation is set up correctly, otherwise Datadog will not be able to capture the necessary data. For a TypeScript project, it starts with the file where we declare our tracer provided by the dd-trace library.

It needs to look something like this:

// This line must come before importing any instrumented module AKA 
// before you import any other module or library you want to use in your app.

import tracer from "dd-trace";

tracer.init({
  logInjection: true,
});
export default tracer;

Here, we have initialized the tracer with some options.

The logInjection?: boolean property in the TracerOptions interface (if you’re curious, you’ll find it inside node_modules/dd-trace/index.d.ts) is a configuration flag that enables trace ID injection into log records. This feature creates a direct correlation between application traces and log entries.

When logInjection is set to true, the Datadog tracer automatically injects trace identifiers (such as trace ID and span ID) into your application’s log records. This here enables you to easily jump between trace data and corresponding log messages that occurred during the same request or operation.

Just a note, logInjection might not be needed if you use a library like Winston, but I’d rather be explicit and future-proof.

Next, we will continue with the file where your server is initiated, most often src/index.ts.

Import tracer from “./tracer” before you import any other module or library you want to use in your application — this is very important, otherwise the instrumentation won’t work properly.

import tracer from "./tracer";

Now you can configure the tracer to suit your needs and you can do that with the help of the .use() method. The use() method is used to configure specific plugins that provide automatic instrumentation for popular libraries, meaning you can configure how Datadog tracks the usage of libraries like Node’s http (again, if you’re really curious, you can find the definition inside node_modules/dd-trace/index.d.ts file).

TypeScript Examples

Configuring a Specific Plugin

In case you want to add custom tags to every request handled by the http module.

// Configure the http plugin to block certain status codes from being marked as errors
// Or set custom tags on the current APM span`

tracer.use("http", {
  validateStatus: (code: number) => code < 400 || code === 404,
  headers: ["User-Agent", "Referer"], // Capture specific headers as tags
  client: {
        hooks: {
            request: (
                span?: Span,
                req?: Request,
                res?: Response
            ) => {
                span.setTag(
                    "customTag",
                    "custom tag value"
                );
            },
        },
    },
});

Disabling Instrumentation

If you find that a specific library (like redis) is generating too many spans, you can disable it.

// Disable Redis auto-instrumentation
tracer.use("redis", {
  enabled: false,
});

You can find more examples in the official documentation for plugin configuration here.

Adding tags

As we’ve seen so far, the current span can be accessed inside the request hook through the span parameter.

Another way to add tags is through the tracer object itself, using the tracer.scope().active() method, which returns the currently active span.

import tracer from "./tracer";

const span = tracer.scope().active();

//for our error span, we can make sure we capture all the necessary attributes like this:

span.setTag("error.message", "Your error message here");

// or, if you have an actual error object:

try {
  // some code that may throw an error
} catch (error) {
  const span = tracer.scope().active();
  span.setTag("error", {
    message: "Your error message here",
    kind: error.name,
    stack: error.stack,
  });
}

I hope this article has given you a good high-level overview of some of the main Datadog concepts that you can use in your everyday coding life. If you like it, show some love please 🙂🙏

DEV Community