A program that connects to a data store (source or target).
Process thru a prioritized list of classifiers to determine the schema for your data; and then creates metadata tables in the AWS Glue Data Catalog.
Data Discovery Tools: Crawlers are an essential component of the AWS Glue service. They act like automated detectives, scanning various data sources to understand the structure and content of your data.
Populating the Data Catalog: Crawlers extract metadata (data about data) including:
Schema (column names, data types)
File formats (CSV, JSON, Parquet, etc.)
Partitions
Other relevant properties
Updating Metadata: Crawlers can be run on a schedule or on-demand to keep the AWS Glue Data Catalog up-to-date as your data evolves.
Key Benefits of AWS Glue Crawlers:
Automation: Save time and effort by automating the process of discovering data structures and updating your data catalog. This removes manual work, especially for large datasets or evolving data sources.
Metadata Centralization: Crawlers populate the AWS Glue Data Catalog, which serves as a central repository for metadata. This helps various AWS services (like Athena, Redshift Spectrum, or Glue ETL jobs) understand and work with your data seamlessly.
Schema Inference: Glue Crawlers can often infer the schema of your data automatically, reducing manual configuration needed when creating tables in the catalog.
Types of Data Sources Supported:
Amazon S3: Crawl data stored in S3 Buckets in various formats.
Relational Databases (RDS, DynamoDB, etc.): Discover tables and their schemas contained within supported databases.
Custom Classifiers: You can extend a crawler’s capabilities with custom classifiers to recognize new file formats or extract more specialized metadata.