💡Architecture

This section outlines the architecture of the PIARA platform, detailing its core internal components, the technologies used for data storage and processing, and its key external dependencies and integrations. The architecture is designed for scalability, flexibility, and efficient handling of large volumes of data.

The diagram below provides a high-level visual overview of how these components interact.

High-level diagram showing PIARA core components, data stores (PostgreSQL, Elasticsearch, ArangoDB, Redis, S3, Kafka), workers, connectors, pipelines and external integrations.
High-Level Architecture

Core Components

These are the main functional blocks built into the PIARA platform:

  • API: PIARA exposes a RESTful API for programmatic interaction with its data and functionalities. This serves as the primary interface for integrations, scripts, and potentially the platform's own user interface.

  • Workers(AKA Feeders): These are versatile background components that form the backbone of PIARA's data processing and external interaction capabilities. They consume tasks from the Task Queue (Kafka) and are responsible for a wide range of duties, including:

    • Data Processing: Sending and receiving data from various sources. This includes interacting with TAXII endpoints, connecting to non-TAXII third-party services (acting as specific "connectors" for those sources), and synchronizing data between different PIARA instances.

    • General Task Execution: Performing other miscellaneous background jobs required by the platform.

    • Enrichment: Enhancing intelligence data with AI(using models from OpenAI, Ollama, or Google Gemini)

    • Translation: Translating text content within intelligence data using Systran or Bing Translate.

  • Task Queue (Kafka): Acts as the central nervous system for asynchronous operations. It manages the distribution of tasks (e.g., data ingestion requests, enrichment jobs, synchronization tasks) to available Workers.

Data Storage & Persistence

PIARA utilizes a multi-datastore strategy, leveraging different technologies best suited for specific data types and access patterns:

  • Relational & Configuration Data (PostgreSQL): Stores essential relational data, including system configuration, user accounts, roles, permissions, and controlled vocabularies (taxonomies, lists). In small setups, it might also hold primary STIX data.

  • Primary Intelligence Data (Elasticsearch): The main engine for storing, indexing, and searching the bulk of STIX intelligence objects (Indicators, Observables, Reports, Malware, Actors, etc.) and their textual content at scale. Optimized for fast search and complex queries.

  • Graph Data (ArangoDB): Specifically used to store and query the complex relationships between STIX objects. Enables powerful graph traversals and discovery of non-obvious connections within the intelligence data (Currently in beta).

  • File/Object Storage (S3-Compatible): Stores files and images attached to STIX objects. Compatible with AWS S3, MinIO, and other S3-compatible object storage solutions.

  • Cache (Redis): An in-memory data store used for caching frequently accessed data, managing user sessions, and facilitating real-time operations, reducing load on primary databases and improving performance.

Storage Architecture Options

PIARA offers two distinct operational models, primarily differing in their approach to storing relationship data and their suitability for different scales and analytical use cases:

1. Graph-Enhanced Storage (ArangoDB + Elasticsearch + PostgreSQL)

  • Target Environment: Deployments prioritizing powerful graph analysis capabilities for datasets up to a moderate scale (e.g., typically suitable for up to tens of millions of objects). Ideal for analysts who need deep link analysis and visual graph traversal.

  • Architecture: This model actively utilizes all three databases to manage different facets of the threat intelligence data:

  • Architecture: In this model, the core STIX intelligence data (objects and relationships) is replicated across all three: ArangoDB, Elasticsearch, and PostgreSQL. This allows leveraging the unique query strengths of each database. PostgreSQL also manages auxiliary data (configuration, users, etc.).

  • Benefits: Offers maximum flexibility in querying the intelligence data:

    • Specialized graph traversal and analysis via ArangoDB.

    • Powerful full-text search, filtering, and aggregations via Elasticsearch.

  • Considerations: Requires provisioning and managing resources for three database systems actively storing the core intelligence data. Performance characteristics and scalability, especially at larger data volumes (e.g., tens of millions of objects), may vary.

2. Elasticsearch-Optimized Storage (Recommended for Maximum Scalability)

  • Target Environment: Deployments requiring maximum ingestion performance and query scalability for very large data volumes (e.g., hundreds of millions or billions of objects), where standard search capabilities are sufficient.

  • Architecture: This model optimizes for scale by using Elasticsearch as the primary engine for storing and querying almost all intelligence data, including relationships:

    • Elasticsearch: Stores STIX objects, their content, and their relationships. It handles all search, discovery, and analysis queries using its native capabilities.

    • ArangoDB: Is not utilized in this configuration.

    • PostgreSQL: Plays a limited supporting role, storing auxiliary data like configuration, users, roles, and vocabularies (as detailed in the 'Data Storage & Persistence' section).

  • Benefits: Maximizes ingestion rates and query performance for extremely large datasets by relying solely on Elasticsearch's horizontally scalable architecture.

  • Trade-offs: Lacks the specialized graph traversal capabilities provided by ArangoDB. Relationship analysis relies on Elasticsearch's standard search and aggregation features.

Choosing the appropriate storage architecture depends on the expected data volume, query load, and operational requirements of the specific PIARA deployment. The Elasticsearch-Optimized model is generally recommended for typical operational use.

Last updated