RosettaDB Commands
Available Commands
- config: Manage RosettaDB configuration settings.
- init: Initialize a new RosettaDB project with required configuration files.
- validate: Validate database connections.
- drivers: List and manage supported database drivers.
- extract: Extract schema metadata from a source database.
- compile: Compile DBML models into target DDL statements.
- apply: Apply generated DDL to the target database.
- diff: Compare and display differences between the DBML model and the database.
- test: Run data quality and validation tests against your database.
- dbt: Generate dbt models for analytics workflows.
- generate: Generate Spark code for data transfers (Python or Scala).
- query: Explore and query your data using AI-driven capabilities.
config
Rosetta by default expects the YAML config file to be named main.conf
and looks for it by default in the current folder. The configuration file can be overridden by using the --config, -c
command line argument (Command Line Arguments below for more details).
Here is the list of available configurations in the main.conf
file:
connections:
# The name of the connection
- name: bigquery_prod
# The name of the default database to use
databaseName: bigquery-public-data
# The name of the default schema to use
schemaName: breathe
# The type of the database
dbType: bigquery
# The connection uri for the database
url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];...
# The name of the database user
userName: user
# The password of the database user
password: password
# The name of tables to include which is optional
tables:
- table_one
- table_two
In the YAML config file you can also use environment variables. An example usage of environment variables in config file:
connections:
- name: snowflake_weather_prod
databaseName: SNOWFLAKE_SAMPLE_DATA
schemaName: WEATHER
dbType: snowflake
url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com/?<connection_params>
userName: ${USER}
password: ${PASSWORD}
Using External Translator and Custom Attributes
RosettaDB supports custom translators and translation attributes, allowing users to define or extend database-specific configurations via external CSV files. - External Translator: Users can specify a custom CSV file for translations by setting the EXTERNAL_TRANSLATION_FILE environment variable. This file allows adjustments in how database schemas are interpreted. - Translation Attributes: Additional attributes like ordinalPosition, autoincrement, nullable, and primaryKey can be defined in a separate attributes CSV file. Set the EXTERNAL_TRANSLATION_ATTRIBUTE_FILE environment variable to the file’s location to apply these attributes. - Indices: Rosetta supports index definitions in databases like Google Cloud Spanner, configured directly in model.yaml files to manage primary and secondary keys effectively.
For detailed setup instructions and examples, refer here.
Safety Operation
In model.yaml
you can find the attribute safeMode
which is by default disabled (false). If you want to prevent any DROP operation during
apply
command, set safeMode: true
.
Operation level
In model.yaml
you can find the attribute operationLevel
which is by default set to schema
. If you want to apply changes on to database level in your model instead of the specific schema in
apply
command, set operationLevel: schema
.
Fallback Type
In model.yaml
you can define the attribute fallbackType
for columns that are of custom types, not supported for translations or not included in the translation matrix.
If a given column type cannot be translated then the fallbackType will be used for the translation. fallbackType
is optional.
Google Cloud Spanner JDBC Fix
Note: If you face one of the following errors with Google Cloud Spanner JDBC
java.sql.SQLException: No suitable driver
or
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
you can fix it by running the following command where your driver is located:
zip -d google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF'
init
This command will generate a project (directory) if specified, a default configuration file located in the current directory with example connections for bigquery
and snowflake
, and the model directory.
rosetta init [PROJECT_NAME]
Parameter | Description |
---|---|
(Optional) PROJECT_NAME | Project name (directory) where the configuration file and model directory will be created. |
Example:
#example with 2 connections
connections:
- name: snowflake_weather_prod
databaseName: SNOWFLAKE_SAMPLE_DATA
schemaName: WEATHER
dbType: snowflake
url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com/?<connection_params>
userName: bob
password: bobPassword
- name: bigquery_prod
databaseName: bigquery-public-data
schemaName: breathe
dbType: bigquery
url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];...
userName: user
password: password
tables:
- bigquery_table
validate
This command validates the configuration and tests if rosetta can connect to the configured source.
rosetta [-c, --config CONFIG_FILE] validate [-h, --help] [-s, --source CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection name to extract schema from. |
drivers
This command can list drivers that are listed in a drivers.yaml
file and by choosing a driver you can download it to the ROSETTA_DRIVERS
directory which will be automatically ready to use.
rosetta drivers [-h, --help] [-f, --file] [--list] <indexToDownload> [-dl, --download]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-f, --file DRIVERS_FILE | YAML drivers file path. If none is supplied it will use drivers.yaml in the current directory and then fallback to our default one. |
--list | Used to list all available drivers. |
-dl, --download | Used to download selected driver by index. |
indexToDownload | Chooses which driver to download depending on the index of the driver. |
Example (drivers.yaml)
- name: MySQL 8.0.30
link: https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip
- name: Postgresql 42.3.7
link: https://jdbc.postgresql.org/download/postgresql-42.3.7.jar
extract
This is the command that extracts the schema from a database and generates declarative DBML models that can be used for conversion to alternate database targets.
rosetta [-c, --config CONFIG_FILE] extract [-h, --help] [-s, --source CONNECTION_NAME] [-t, --convert-to CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection name to extract schema from. |
-t, --convert-to CONNECTION_NAME (Optional) | The target connection name in which source DBML converts to. |
Example:
---
safeMode: false
databaseType: bigquery
operationLevel: database
tables:
- name: "profiles"
type: "TABLE"
schema: "breathe"
columns:
- name: "id"
typeName: "INT64"
jdbcDataType: "4"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 10
scale: 0
precision: 10
primaryKey: false
nullable: false
autoincrement: true
- name: "name"
typeName: "STRING"
jdbcDataType: "12"
ordinalPosition: 0
primaryKeySequenceId: 0
columnDisplaySize: 255
scale: 0
precision: 255
primaryKey: false
nullable: false
autoincrement: false
compile
The compile
command generates DDL (Data Definition Language) statements for a target database based on the DBML (Database Markup Language) extracted from a source database by the previous (extract
) command. It builds schemas and tables in the target database using the extracted database schema.
rosetta [-c, --config CONFIG_FILE] compile [-h, --help] [-t, --target CONNECTION_NAME] [-s, --source CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME (Optional) | The source connection name where models are generated. |
-t, --target CONNECTION_NAME | The target connection name in which source DBML converts to. |
-d, --with-drop | Add query to drop tables when generating ddl. |
Example:
CREATE SCHEMA breathe;
CREATE TABLE breathe.profiles(id INTEGER not null AUTO_INCREMENT, name STRING not null);
Example Command:
Assuming main.conf
is present in your working directory and configured for both source and target connections, a basic usage example is as follows:
rosetta compile -s source_db_connection -t target_db_connection
This command:
- Connects to
source_db_connection
to retrieve DBML data. - Converts the DBML into DDL specific to
target_db_connection
.
Additional Notes
- The
--with-drop
option should be used with caution, as it will delete existing tables in the target database. - Ensure that the target connection name is correctly set in
main.conf
or passed directly as a parameter.
apply
The apply command compares the current database state with the model defined in your Rosetta project. It generates the necessary DDL to align the database with the model and applies the changes to the database. If the git_auto_commit setting in main.conf is set to true, Rosetta will also automatically commit and push the updated model to the associated Git repository.
rosetta [-c, --config CONFIG_FILE] apply [-h, --help] [-s, --source CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. |
-m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is model.yaml |
Example:
Actual Database (Current State)
---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
- name: "actor"
type: "TABLE"
columns:
- name: "actor_id"
typeName: "SMALLINT UNSIGNED"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 5
scale: 0
precision: 5
nullable: false
primaryKey: true
autoincrement: false
tests:
assertion:
- operator: '='
value: 16
expected: 1
Expected Database (Target State)
---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
- name: "actor"
type: "TABLE"
columns:
- name: "actor_id"
typeName: "SMALLINT UNSIGNED"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 5
scale: 0
precision: 5
nullable: false
primaryKey: true
autoincrement: false
tests:
assertion:
- operator: '='
value: 16
expected: 1
- name: "first_name"
typeName: "VARCHAR"
ordinalPosition: 0
primaryKeySequenceId: 0
columnDisplaySize: 45
scale: 0
precision: 45
nullable: false
primaryKey: false
autoincrement: false
tests:
assertion:
- operator: '!='
value: 'Michael'
expected: 1
The apply command detects that the first_name column is missing in the actual database. It generates a DDL statement to alter the actor table and add the first_name column.
Outputs: - A snapshot of the updated database schema is saved in the source directory. - The executed DDL is logged for reference.
diff
The diff command shows the differences between the current local model and the state of the database. This can help identify any tables that have been added or removed, or columns that have been modified in the database schema. It’s a valuable tool for tracking schema changes and maintaining consistency between development and production environments.
rosetta [-c, --config CONFIG_FILE] diff [-h, --help] [-s, --source CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. |
-m, --model MODEL_FILE (Optional) | The model file to use for apply. Default is model.yaml |
Example Output:
When there are differences between the local model and the targeted database schema, diff provides a detailed report, highlighting table and column changes. Below is a sample output from the diff
command:
There are changes between local model and targeted source
Table Changed: Table 'actor' columns changed
Column Changed: Column 'actor_id' in table 'actor' changed 'Precision'. New value: '1', old value: '5'
Column Changed: Column 'actor_id' in table 'actor' changed 'Autoincrement'. New value: 'true', old value: 'false'
Column Changed: Column 'actor_id' in table 'actor' changed 'Primary key'. New value: 'false', old value: 'true'
Column Changed: Column 'actor_id' in table 'actor' changed 'Nullable'. New value: 'true', old value: 'false'
Table Added: Table 'address'
Example Command
To use the diff
command with the default configuration file and model file, you might run:
rosetta -s source_db_connection
In this example:
- The command compares the
source_db_connection
schema with the specified local model. - Differences are displayed, such as table and column changes.
Additional Notes
- Usage of
--model
: When using a specific model file other thanmodel.yaml
, specify it with the--model
parameter. - Table and Column Change Detection: The output categorizes schema differences into table changes, column modifications, and new or removed tables.
- Precision in Changes: Each change specifies old and new values, helping identify unintended modifications or updates needed in the target database.
test
This command runs tests for columns using assertions. Then they are translated into query commands, executed, and compared with an expected value. Currently supported assertions are: equals(=), not equals(!=), less than(<), more than(>), less than or equals(<=), more than or equals(>=), contains(in), is null, is not null, like, between
. Examples are shown below:
rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME]
rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection is used to specify which models and connections to use. |
-t, --target CONNECTION_NAME (Optional) | The target connection is used to specify the target connection to use for testing the data. The source tests needs to match the values from the tarrget connection. |
Note: Value for BigQuery Array columns should be comma separated value ('a,b,c,d,e').
Example:
---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
- name: "actor"
type: "TABLE"
columns:
- name: "actor_id"
typeName: "SMALLINT UNSIGNED"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 5
scale: 0
precision: 5
nullable: false
primaryKey: true
autoincrement: false
tests:
assertion:
- operator: '='
value: 16
expected: 1
- name: "first_name"
typeName: "VARCHAR"
ordinalPosition: 0
primaryKeySequenceId: 0
columnDisplaySize: 45
scale: 0
precision: 45
nullable: false
primaryKey: false
autoincrement: false
tests:
assertion:
- operator: '!='
value: 'Michael'
expected: 1
When running the tests against a target connection, you don't have to specify the expected value.
---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
- name: "actor"
type: "TABLE"
columns:
- name: "actor_id"
typeName: "SMALLINT UNSIGNED"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 5
scale: 0
precision: 5
nullable: false
primaryKey: true
autoincrement: false
tests:
assertion:
- operator: '='
value: 16
- name: "first_name"
typeName: "VARCHAR"
ordinalPosition: 0
primaryKeySequenceId: 0
columnDisplaySize: 45
scale: 0
precision: 45
nullable: false
primaryKey: false
autoincrement: false
tests:
assertion:
- operator: '!='
value: 'Michael'
If you need to overwrite the test column query (e.x. for Geospatial data), you can use columnDef
.
---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
- name: "actor"
type: "TABLE"
columns:
- name: "actor_id"
typeName: "SMALLINT UNSIGNED"
ordinalPosition: 0
primaryKeySequenceId: 1
columnDisplaySize: 5
scale: 0
precision: 5
nullable: false
primaryKey: true
autoincrement: false
tests:
assertion:
- operator: '='
value: 16
expected: 1
- name: "wkt"
typeName: "GEOMETRY"
ordinalPosition: 0
primaryKeySequenceId: 0
columnDisplaySize: 1000000000
scale: 0
precision: 1000000000
columnProperties: []
nullable: true
primaryKey: false
autoincrement: false
tests:
assertion:
- operator: '>'
value: 434747
expected: 4
columnDef: 'ST_AREA(wkt, 1)'
Output example:
Running tests for mysql. Found: 2
1 of 2, RUNNING test ('=') on column: 'actor_id'
1 of 2, FINISHED test on column: 'actor_id' (expected: '1' - actual: '1') ......................... [PASS in 0.288s]
2 of 2, RUNNING test ('!=') on column: 'first_name'
2 of 2, FINISHED test on column: 'first_name' (expected: '1' - actual: '219') ..................... [FAIL in 0.091s]
dbt
The dbt
command generates dbt models based on the DBML (Database Markup Language) extracted from a source database. This DBML should have been generated by the previous (extract
) command, providing a foundation for creating structured data transformations within dbt
.
rosetta [-c, --config CONFIG_FILE] dbt [-h, --help] [-s, --source CONNECTION_NAME]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection name where models are generated. |
Example Command:
Here’s a basic example command that uses the dbt function:
rosetta dbt -s source_db_connection
This command will:
- Use source_db_connection to locate the DBML generated from the extract command.
- Generate corresponding dbt models that reflect the structure of the source database.
Additional Notes
- Integration with dbt: The generated
dbt
models allow for scalable and reusable SQL transformations, helping align your data structure with your analytics or ETL workflows. - Configuration: Ensure that the configuration file (main.conf or specified config) contains accurate connection details for the source database, as it serves as the base for generating
dbt
models.
generate
This command will generate Spark Python (file) or Spark Scala (file), firstly it extracts a schema from a source database and gets connection properties from the source connection, then it creates a python (file) or scala (file) that translates schemas, which is ready to transfer data from source to target.
rosetta [-c, --config CONFIG_FILE] generate [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] [--pyspark] [--scala]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection name to extract schema from. |
-t, --target CONNECTION_NAME | The target connection name where the data will be transfered. |
--pyspark | Generates the Spark SQL file. |
--scala | Generates the Scala SQL file. |
Example Command:
Here’s a basic example command that uses the generate
function:
rosetta generate -s source_db_connection -t target_db_connection --pyspark
This command will:
- Connect to the specified source and target databases using the connection details provided.
- Extract the schema from the source.
- Generate a PySpark or Scala script, depending on the selected flag
(--pyspark or --scala)
, which is ready to transfer data from source to target.
Additional Notes
- JDBC Drivers: Ensure you have the correct JDBC drivers for both the source and target databases. These drivers should be specified in the
spark.driver.extraClassPath
. - Database Configuration: Modify the
source_jdbc_url
,target_jdbc_url
, and other connection parameters as per your environment setup. - Mode Options: The
mode("overwrite")
option in.save()
will overwrite any existing data in the target table. Change it as needed (e.g.,append
,ignore
,error
).
query
The query command allows you to use natural language commands to query your databases, transforming these commands into SQL SELECT statements. By leveraging the capabilities of AI and LLMs, specifically OpenAI models, it interprets user queries and generates the corresponding SQL queries. For effective use of this command, users need to provide their OpenAI API Key and specify the OpenAI model to be utilized. The output will be written to a CSV file. The max number of rows that will be returned is 200. You can overwrite this value, or remove completely the limit. The default openai model that is used is gpt-3.5-turbo.
rosetta [-c, --config CONFIG_FILE] query [-h, --help] [-s, --source CONNECTION_NAME] [-q, --query "Natural language QUERY"] [--output "Output DIRECTORY or FILE"]
Parameter | Description |
---|---|
-h, --help | Show the help message and exit. |
-c, --config CONFIG_FILE | YAML config file. If none is supplied it will use main.conf in the current directory if it exists. |
-s, --source CONNECTION_NAME | The source connection is used to specify which models and connection to use. |
-q --query "Natural language QUERY" | pecifies the natural language query to be transformed into an SQL SELECT statement. |
-l --limit Response Row limit (Optional) | Limits the number of rows in the generated CSV file. If not specified, the default limit is set to 200 rows. |
--no-limit (Optional) | Specifies that there should be no limit on the number of rows in the generated CSV file. |
Example (Setting the key and model) :
(Config file)
openai_api_key: "sk-abcdefghijklmno1234567890"
openai_model: "gpt-4"
connections:
- name: mysql
databaseName: sakila
schemaName:
dbType: mysql
url: jdbc:mysql://root:sakila@localhost:3306/sakila
userName: root
password: sakila
- name: pg
databaseName: postgres
schemaName: public
dbType: postgres
url: jdbc:postgresql://localhost:5432/postgres?user=postgres&password=sakila
userName: postgres
password: sakila
Example (Query)
rosetta query -s mysql -q "Show me the top 10 customers by revenue."
CSV Output Example
customer_name,total_revenue,location,email
John Doe,50000,New York,johndoe@example.com
Jane Smith,45000,Los Angeles,janesmith@example.com
David Johnson,40000,Chicago,davidjohnson@example.com
Emily Brown,35000,San Francisco,emilybrown@example.com
Michael Lee,30000,Miami,michaellee@example.com
Sarah Taylor,25000,Seattle,sarahtaylor@example.com
Robert Clark,20000,Boston,robertclark@example.com
Lisa Martinez,15000,Denver,lisamartinez@example.com
Christopher Anderson,10000,Austin,christopheranderson@example.com
Amanda Wilson,5000,Atlanta,amandawilson@example.com
Note: When giving a request that will not generate a SELECT statement the query will be generated but will not be executed rather be given to the user to execute on their own.