跳到主要内容

Debug Node Registration Issues

This guide helps you debug issues that occur when registering nodes (devices) with a private ESP RainMaker deployment. Node registration is typically performed using the ESP RainMaker Admin CLI, which generates device certificates and bulk-registers nodes via an AWS Batch job.

Before you start, have these ready:

  • The request_id returned after running certs devicecert register
  • The node_id(s) of the affected nodes
  • The admin user ID (email) used to run the registration
  • Approximate time when the registration was triggered

Step 1: Identify Your Symptom

SymptomGo to
Admin CLI generate command fails with an errorAdmin CLI — Certificate Generation Errors
Admin CLI register command fails before submitting the jobAdmin CLI — Registration Submission Errors
Registration job submitted but no confirmation email receivedRegistration Job Submitted — No Email or Status Unknown
getcertstatus shows FAILURE or some nodes failedRegistration Job Failed or Partial Failures
Registration job is stuck in REQUESTED or INPROGRESS for too longRegistration Job Stuck or Timed Out
Nodes registered but not visible on the RainMaker DashboardNodes Not Visible on the RainMaker Dashboard
Node is registered but the device cannot connect to the cloudNode Registered but Device Cannot Connect
Getting a specific error code (106xxx / 200xxx)Error Code Reference

Node Registration Overview

Understanding the flow helps you identify at which stage a failure occurred.

Step 1  Admin CLI generates certificates locally → node_certs.csv
Step 2 CLI calls GET /admin/node_certificates/register → gets S3 pre-signed URL + request_id
Step 3 CLI uploads node_certs.csv to S3
Step 4 CLI calls POST /admin/node_certificates/register → triggers AWS Batch job
Step 5 AWS Batch processes each node: creates IoT Thing, registers certificate,
attaches policy, writes to DynamoDB nodes_v3 table
Step 6 Admin receives email with job summary
Step 7 Device boots, connects to MQTT, publishes config → node visible on dashboard

Admin CLI — Certificate Generation Errors

These errors occur when running python rainmaker_admin_cli.py certs devicecert generate.

Check 1: Verify the command arguments

Error MessageCauseFix
"Maximum of 50,000 nodes generation supported in a single request."--count exceeds 50,000Split into multiple batches with --count ≤ 50000
"<count> must be > 0"Count is zero or negativeProvide a valid --count value
"'node_id' column not found in file"--inputfile CSV is missing the node_id columnEnsure the input CSV has a header row with node_id as the column name
"CA key file is not provided" / "CA cert file is not provided"Only one of --cacertfile / --cakeyfile was givenProvide both --cacertfile and --cakeyfile together
"At least one of the following must be provided: --count, ADDITIONAL_VALUES, --inputfile"No node count source specifiedProvide --count, --inputfile, or configure ADDITIONAL_VALUES in config/binary_config.ini

Check 2: Verify the output directory

After a successful generate, confirm these files exist in the output directory:

<outdir>/<date>/Mfg-<N>/
common/
node_certs.csv ← required for the next `register` step
ca.crt ← CA certificate
node_ids.csv ← list of generated node IDs
endpoint.txt ← MQTT broker hostname
node_details/
node-<idx>-<node_id>/
node.crt ← device certificate
node.key ← device private key

If any of these files are missing, re-run generate. If the output directory is missing entirely, the tool failed before writing any files — check for Python exceptions in the terminal output.

备注

The node_certs.csv in common/ is the input file for the register command. Use the full path when calling register --inputfile.


Admin CLI — Registration Submission Errors

These errors occur when running python rainmaker_admin_cli.py certs devicecert register.

Check 1: Validate the input CSV

Error MessageCauseFix
"Input file is invalid. Please provide file containing the certificates"CSV has no certs column or all cert values are emptyUse the node_certs.csv generated by the generate step
"Column count mismatch in row N"The CSV has inconsistent column countsOpen the CSV in a text editor and fix the row with index N
"Certificate CN 'X' does not match node_id 'Y'"The certificate's Common Name does not match the node_id columnRegenerate the certificates — CN mismatch means the cert and node ID are from different batches
"Invalid CSV file" (error 106026)CSV format is malformedValidate the CSV with a CSV linter; check for unescaped quotes or missing commas

Check 2: Validate tags and policies

Error MessageCauseFix
"Invalid tags specified by user. Check tags format."Tags are not in key:value formatUse --tags key1:value1,key2:value2
"Invalid tags specified by user. Check whether the tags are referencing the proper column names."A tag references a CSV column that doesn't existEnsure the column name in --tags key:@column_name exactly matches a column in the CSV
"--node_policies option cannot be used together with --update_nodes."Conflicting flagsRemove --node_policies when using --update_nodes
"Invalid value for --node_policies"Unknown policy nameValid values are mqtt and videostream

Check 3: Verify connectivity and authentication

Error MessageCauseFix
"Could not connect. Please check your Internet connection."Admin CLI cannot reach the RainMaker backendCheck your internet connection; verify the server endpoint is correct: account serverconfig
"HTTP Request timed out."Request took longer than 30 secondsRetry. If this persists, check if the backend is reachable
"Failed to upload Device Certificates"S3 pre-signed URL upload failedThe pre-signed URL may have expired (1-hour validity). Re-run register to get a fresh URL
"Request to register device certificate failed"The POST /admin/node_certificates/register API call failedCheck the exact HTTP error code in the output. Run with verbose logging if available
"Unable to verify SSL certificate."TLS verification failedVerify that rmaker_admin_lib/server_cert/server_cert.pem is the correct certificate for your deployment
提示

When you successfully submit the registration job, the CLI prints a request_id. Save this value — you need it to check the job status later using getcertstatus --requestid <request_id>.


Registration Job Submitted — No Email or Status Unknown

If the registration job was submitted but you haven't received a confirmation email, or the status is unclear, follow these steps.

Step 1: Check the job status using the CLI

Run:

python rainmaker_admin_cli.py certs devicecert getcertstatus --requestid <request_id>
  • success → Job completed. All nodes registered. If nodes are still not visible, see Nodes Not Visible on the RainMaker Dashboard.
  • in_progress → Job is still running. Wait and check again. Large batches can take up to 10 hours.
  • failure → Job failed. See Registration Job Failed or Partial Failures.
  • No output / error → The request_id may be invalid, or the entry expired in DynamoDB (entries are kept for a limited time). Verify the request_id and check DynamoDB directly (Step 2).

Step 2: Check the request record in DynamoDB

Go to AWS Console → DynamoDB → Tables → admin_node_registration_requests.

Query with:

  • Partition key (user_id): the admin user's Cognito user ID
  • Sort key (request_id): the request ID from the CLI

What to look for:

FieldWhat it tells you
statusCurrent job state: REQUESTED, INPROGRESS, SUCCESS, FAILURE
total_countTotal nodes in the uploaded CSV
completed_countNodes successfully registered so far
failed_countNodes that failed registration
request_timestampWhen the job was submitted

If no entry is found with that request_id, the job was never submitted to DynamoDB. The POST /admin/node_certificates/register API call likely failed silently. Re-run the register command.

Step 3: Check the Lambda log for submission errors

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-CertificateRegister, and run:

fields @timestamp, @message
| sort @timestamp asc
| filter @message like "<request_id>"

Look for:

  • Successful job submission: message containing "Submitted batch job" or the job ID
  • Any error messages indicating why the submission failed

Step 4: Check if the confirmation email was blocked

The confirmation email is sent via AWS SES. If SES is not verified for your deployment, emails may be silently dropped. Run the pre-flight check:

# The CLI checks SES status during register — look for any SES warning in the output

Also check AWS Console → SES → Verified identities to confirm the sender email is verified. If it is not, verify it and re-run the registration job.


Registration Job Failed or Partial Failures

Step 1: Get the overall failure summary

Check the status via CLI or DynamoDB as described above. Note the failed_count and completed_count fields in the admin_node_registration_requests table.

Step 2: Find which specific nodes failed

Go to AWS Console → DynamoDB → Tables → node_manufacturing_errors.

Query with partition key (request_id): the request ID.

This table contains one entry per failed node, with fields:

  • node_id — which node failed
  • error — the error message from AWS IoT Core or the batch container
  • request_id — links back to the registration job

Step 3: Check the AWS Batch job logs

The bulk registration runs inside an AWS Batch container. The container logs are the most detailed source of per-node errors.

  1. Go to AWS Console → Batch → Jobs.
  2. Filter by Job queue: thing-certificate-registration.
  3. Find your job by checking the submission time (matches request_timestamp in DynamoDB).
  4. Click the job → click Log stream to open the CloudWatch log stream.

The log stream is under the log group /aws/batch/job. Each node's registration attempt is logged here with the outcome.

What to look for in Batch logs:

Log messageMeaning
"Thing already exists"A node with this ID is already registered. Use --force flag to allow re-registration
"Certificate is already Provisioned"The same certificate was registered before. Use --force
"Error in registering certificate"The certificate PEM is malformed or invalid. Regenerate the certificate for this node
"Invalid Certificate"Certificate format error. Check for truncated PEM data in the CSV
"Error in creating thing"AWS IoT Core CreateThing failed. Check IAM role permissions for the Batch job
"Node limit exceeded"Your deployment's licensed node count is exhausted. Contact Espressif to increase the limit

Step 4: Re-register failed nodes

After identifying and fixing the root cause:

  1. Extract the failed node_id values from the node_manufacturing_errors table.
  2. Create a new CSV containing only the failed nodes (with their certificates from node_details/).
  3. Re-run register --inputfile <new_csv> --force to register them without failing on any already-registered nodes.
提示

The --force flag tells the server to skip duplicate node errors and continue registering remaining nodes. Use it when re-running a partially failed job.


Registration Job Stuck or Timed Out

The AWS Batch job has a maximum timeout of 10 hours (36000 seconds). For very large batches, the job can run close to this limit.

Step 1: Check the AWS Batch job status

  1. Go to AWS Console → Batch → Jobs.
  2. Filter by job queue thing-certificate-registration.
  3. Find the job matching your request_id (visible in the job name or environment variables).
Job StatusMeaning
SUBMITTED / PENDINGJob is queued, waiting for a compute instance
RUNNABLEJob is waiting for compute capacity in the environment
STARTING / RUNNINGJob is actively processing
SUCCEEDEDAll nodes processed
FAILEDContainer exited with a non-zero code or hit the 10-hour timeout

If the job is stuck in PENDING or RUNNABLE for more than 10–15 minutes, the compute environment may not have capacity. Check:

  • AWS Console → Batch → Compute environments → ThingCertificateRegister: verify the environment is ENABLED and VALID.
  • Check if the EC2 Service Limit for the instance type is reached in your region.

Step 2: Check for Batch job timeout

If the job status is FAILED and the batch ran for exactly 10 hours, it hit the timeout. This typically happens with very large batches (tens of thousands of nodes).

Fix:

  • Split the CSV into smaller batches and register each separately.
  • The recommended batch size is 10,000–20,000 nodes per job.

Step 3: Check CloudWatch for the Batch container logs

Go to CloudWatch → Log groups → /aws/batch/job and find the log stream for the failed job.

Look for:

  • The last completed_count logged before the job was killed — this tells you how many nodes were registered before the timeout.
  • Any specific error that caused the container to exit prematurely (e.g., DynamoDB throttling, IoT API rate limits).

Step 4: Check the DynamoDB request record

Check admin_node_registration_requests for the completed_count at the time of failure. Nodes with a lower index than completed_count are registered. Re-register only the remaining nodes using the --force flag.


Nodes Not Visible on the RainMaker Dashboard

Even after a successful bulk registration, nodes may not be visible on the dashboard until the device connects and sends its configuration. There are two distinct cases.

Case A: Node Registered but Never Appeared on Dashboard

Bulk registration creates the IoT Thing and certificate in AWS IoT Core and writes a record to DynamoDB nodes_v3. However, the full device configuration (name, type, firmware version, parameters) is only stored when the device itself publishes its config after first boot.

Step 1: Verify the node exists in DynamoDB

Go to AWS Console → DynamoDB → Tables → nodes_v3.

Query with partition key node_id.

  • Entry exists → Node is registered in the system. The dashboard should show it (possibly with limited info until the device publishes config). If it doesn't appear, check admin dashboard permissions.
  • No entry found → The bulk registration did not complete for this node. Check node_manufacturing_errors for this node ID and re-register it.

Step 2: Check if the node is in the pending registration table

Go to DynamoDB → Tables → admin_pending_registration_nodes.

Query with:

  • Partition key (user_id): the admin user ID
  • Sort key (node_id): the node ID

If the entry exists here but not in the admin dashboard view, the dashboard may need a refresh, or the node is awaiting the device to send its first config.

Case B: Device Booted but Node Config Not Updating

After the device boots and connects to MQTT, it should publish its configuration to the topic node/<node_id>/config. This triggers the esp-RegisterDevice Lambda, which stores the device config in DynamoDB.

Step 1: Verify the device published its config

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-RegisterDevice, and run:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "<node_id>"
  • Entries found with no errors → Config was received and stored. Refresh the dashboard.
  • Entries found with errors → Note the error and check the device's config payload format.
  • No entries found → The device did not publish its config, or the MQTT rule esp_node_config is not routing messages to the SQS queue. See Node Registered but Device Cannot Connect.

Step 2: Check the SQS queue for stuck messages

If the device is publishing but the Lambda is not processing:

  1. Go to AWS Console → SQS → esp-deviceRegisterSQS.
  2. Check Messages available and Messages in flight.
  3. If there are messages in the Dead Letter Queue (esp-FailedMessageDLQ), click Send and receive messages → Poll for messages to inspect them.

Failed messages in the DLQ indicate the esp-RegisterDevice Lambda is failing to process them. Check the Lambda logs for errors.

Step 3: Check the Lambda log for config processing errors

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-RegisterDevice, and run:

fields @timestamp, @message
| filter @message like /error/i or @message like /failed/i
| sort @timestamp desc
| limit 50

Look for JSON parse errors or DynamoDB write failures that could cause the node config to not be stored.


Node Registered but Device Cannot Connect

If the node is registered in DynamoDB and AWS IoT Core, but the physical device cannot establish an MQTT connection:

Step 1: Verify the IoT Thing and certificate exist in AWS IoT Core

  1. Go to AWS Console → IoT Core → Manage → All devices → Things.
  2. Search for the node ID.
  3. Click the thing → go to Certificates tab.
  4. Confirm a certificate is attached and its status is Active.

If the certificate status is Inactive or Revoked, the device cannot connect.

Fix: Activate the certificate:

  • Click the certificate → Actions → Activate.

If no certificate is attached, the bulk registration may have created the Thing but failed to attach the certificate. Check node_manufacturing_errors for this node.

Step 2: Verify the IoT policy is attached

On the same certificate page, go to the Policies tab. Confirm the esp-rainmaker-iot-policy (or equivalent policy for your deployment) is attached.

If no policy is attached, the device will connect to MQTT but all publishes and subscribes will be denied with an AUTH_ERROR.

Fix: Attach the policy:

  • Click Actions → Attach policy → select esp-rainmaker-iot-policy.

Step 3: Verify the device is using the correct certificate and key

The certificate (node.crt) and private key (node.key) must be flashed to the device from the same batch as the one registered with the cloud. If the device firmware uses different certificate files, it will not be able to authenticate.

Check that the NVS binary (bin/node-<idx>-<node_id>.bin) was flashed to the correct device.

Step 4: Check node connection logs

Go to CloudWatch → Logs Insights, select /aws/lambda/esp-ConnectionNode, and filter by <node_id>:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like "<node_id>"

Look for AUTH_ERROR or FORBIDDEN_ACCESS disconnect reasons, which indicate a certificate or policy issue.

See Debugging Node Connection Issues for a full guide on MQTT connection problems.

Step 5: Verify the device is connecting to the correct MQTT endpoint

The device must connect to the MQTT endpoint of your private RainMaker deployment, not the default Espressif endpoint. Confirm the endpoint.txt file generated during the generate step was used when building the firmware's NVS partition.

Run:

cat <outdir>/<date>/Mfg-<N>/common/endpoint.txt

Compare this with the MQTT host your device is configured to use.


Error Code Reference

Bulk Node Creation Errors (106xxx)

Error CodeMessageLikely Cause and Fix
106001Node count should be > 0 and ≤ 10000Use --count between 1 and 10,000 per request
106004Request ID is not validThe request_id passed to getcertstatus is wrong or expired
106007URL requested is expiredThe S3 pre-signed URL timed out (1-hour validity). Re-run register
106008Error fetching pre-signed URLBackend error. Retry the registration command
106009File name is missingProvide --inputfile with a valid CSV path
106010Error submitting thing registration jobAWS Batch job submission failed. Check if the Batch compute environment is healthy
106011File md5 is missingThe CLI could not compute the MD5 of the CSV. Verify the file is readable
106016No registration request in progressNo active job for this request_id. The job may have already completed or the ID is wrong
106020Total registered nodes exceeds limitDeployment's licensed node limit reached. Contact Espressif to increase the quota
106026Invalid CSV fileThe uploaded CSV is malformed. Validate the file format
106031CSV must have columns: certs, node_id or CNEnsure the CSV has node_id and certs columns
106033Node ID does not match certificate CNCertificate was generated for a different node ID. Regenerate certificates
106036Invalid node policyValid values: mqtt, videostream
106037node_policies cannot be used with update_nodesRemove --node_policies when using --update_nodes

Self-Claim / Device Registration Errors (200xxx)

Error CodeMessageLikely Cause and Fix
200001MAC Address is missingmac_addr not provided to /claim/node
200009Claim does not existNode was not pre-claimed or the MAC address lookup failed
200019Error in creating thingAWS IoT Core CreateThing call failed. Check IAM permissions for the claim Lambda
200020Certificate is already ProvisionedThis certificate is already registered. Use --force to re-register
200021Error in registering certificateCertificate PEM is invalid or the IoT API returned an error
200022Invalid CertificateThe certificate data is malformed or expired
200036Invalid node policyValid policies: mqtt, videostream

CloudWatch Log Groups Reference

Log GroupWhen to Use
/aws/lambda/esp-CertificateRegisterCheck registration job submission, pre-signed URL generation, job trigger errors
/aws/lambda/esp-NodeIdGenerationCheck node ID generation status when using cloud-based ID generation
/aws/lambda/esp-RegisterDeviceCheck if node config MQTT message was received and stored
/aws/lambda/esp-RegisterNodeCheck HTTPS-based node config registration
/aws/lambda/esp-createAndRegisterThingCheck self-claim device registration errors
/aws/lambda/esp-ConnectionNodeCheck device MQTT connect/disconnect events
/aws/batch/jobCheck detailed per-node logs from the bulk registration Batch container

DynamoDB Tables Reference

TableWhen to CheckKey to Query
admin_node_registration_requestsCheck bulk job status, progress countsuser_id (partition), request_id (sort)
node_manufacturing_errorsFind which specific nodes failed in a batch jobrequest_id (partition), node_id (sort)
nodes_v3Verify a node is registered in the systemnode_id (partition)
admin_pending_registration_nodesCheck nodes registered by admin but not yet claimed by a useruser_id (partition), node_id (sort)

On this page