0:00
Hello. In this lesson,
we will discuss troubleshooting the Apigee platform.
When API requests fail or have higher than usual latency,
it is important to quickly determine the location of the problem.
The best place to start is with the analytics reports provided by the Edge UI.
Remember that Apigee Edge sits between clients and backend systems.
So, analytics reports give the first indication of whether the issue is
within the boundaries of Apigee and
which stakeholders should be involved in troubleshooting.
If you believe that the problem lies within an Apigee component,
focus on the critical path.
Remember that the router, message processor,
and Cassandra are the services that support
runtime API traffic processing. So start with those.
Ensure that the services are up and the logs are free of errors and exceptions.
If you are confident that those services are healthy,
work outward from them through upstream and downstream network components.
This can include firewalls and load balancers
between Apigee and both clients and backends.
High latency can be caused by a number of factors.
Here are some items to consider.
Northbound and southbound connectivity problems,
particularly intermittent connectivity, can
influence how long API requests take to complete.
Latency to backends can impact overall response latency when those backends are
geographically distant on high latency networks
or simply take a long time to process requests.
Large payloads take longer to transport across networks even when round-trip time is low.
Poorly optimized API proxies will take longer to execute,
and you should engage in capacity planning to ensure that you have
enough host resources allocated to handle your expected traffic peak.
If routers or message processors are down or unavailable,
it will impact your overall pool of
capacity and potentially slow down requests processing.
Similarly, if Cassandra nodes are down or unavailable,
API policies that rely on Cassandra may run slower.
The Edge UI offers a trace tool that can be used to see
live API requests and understand where time is being spent during processing.
Here, you can see an API request along with each flow step.
Above the phase details box,
there is a bar that shows the duration of each step.
If you are experiencing unusually high latency,
you will usually see a single step taking
a majority of the time spent processing a request.
This can indicate where the problem is.
For instance, if a backend call is taking most of your processing time,
that would indicate that the problem is somewhere in
the backend network, infrastructure or code.
If a JavaScript cloud is taking most of your processing time,
it could indicate poorly optimized code or asynchronous network operation.
Now, let's discuss how to troubleshoot Apigee platform components.
If you are seeing incorrect or out-of-date analytics data,
you will need to troubleshoot analytics data flow.
Look at the following items if you suspect that analytics data is not successfully
ingesting: connectivity between message processors and Qpidd,
Qpidd availability, queue depth and disk utilization,
availability of the Qpid Server process,
which is responsible for ingesting data that arrives in Qpidd queues,
PostgreSQL availability and disk utilization,
problems capturing or processing customer analytics variables,
connectivity between the Management Server and PostgreSQL,
and finally, too much raw data or disks that are too slow causing reports to timeout.
Deployment failures can be caused by a number of problems.
Here are some areas to investigate: API proxy implementation errors,
unusually large API proxy bundles,
which can cause timeouts when deploying proxy code on
message processors, Management Server availability,
connectivity between the Management Server and Zookeeper or Cassandra,
Zookeeper cluster availability or lack of a Zookeeper leader,
missing API proxy dependencies such as virtual hosts,
target servers or caches,
connectivity between the Edge UI and Management Server,
and finally, router and message processor state.
API proxies will experience
deployment errors if routers or message processors are offline,
although the code will automatically deploy to them when they come back online.
To troubleshoot access to the Management Server or Edge UI,
consider these items: Edge UI service availability,
connectivity between the Edge UI and Management Server, OpenLDAP availability,
connectivity between the Management Server and OpenLDAP,
Zookeeper or Cassandra, incorrect credentials,
and finally, a locked user account.
If you are unable to access the Developer Portal or see data on it,
check these areas: Developer Portal frontend or database availability,
connectivity between the Developer Portal and its database,
connectivity between the Developer Portal and
the Management Server, Management Server availability,
the credentials of the Developer Portal uses to connect to the management API,
the existence of apps and developer information in the management API,
and finally, changes to or deprecation of APIs and API products.
If you are unable to start an Apigee service,
focus on these items: component installation problems
or missing configuration files, Zookeeper availability.
Zookeeper is required to start all higher level Edge services,
Cassandra availability, file or directory permissions,
and finally, disk utilization.
For more information on this topic,
refer to our documentation.
If you have any questions,
please post them on our community. Thanks for watching.