RosettaDB Commands

Available Commands

config: Manage RosettaDB configuration settings.
init: Initialize a new RosettaDB project with required configuration files.
validate: Validate database connections.
drivers: List and manage supported database drivers.
extract: Extract schema metadata from a source database.
compile: Compile DBML models into target DDL statements.
apply: Apply generated DDL to the target database.
diff: Compare and display differences between the DBML model and the database.
test: Run data quality and validation tests against your database.
dbt: Generate dbt models for analytics workflows.
generate: Generate Spark code for data transfers (Python or Scala).
query: Explore and query your data using AI-driven capabilities.

config

Rosetta by default expects the YAML config file to be named main.conf and looks for it by default in the current folder. The configuration file can be overridden by using the --config, -c command line argument (Command Line Arguments below for more details).

Here is the list of available configurations in the main.conf file:

connections:
  # The name of the connection
  - name: bigquery_prod

    # The name of the default database to use
    databaseName: bigquery-public-data

    # The name of the default schema to use
    schemaName: breathe

    # The type of the database
    dbType: bigquery

    # The connection uri for the database
    url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];...

    # The name of the database user
    userName: user

    # The password of the database user
    password: password

    # The name of tables to include which is optional
    tables:
      - table_one
      - table_two

In the YAML config file you can also use environment variables. An example usage of environment variables in config file:

connections:
  - name: snowflake_weather_prod
    databaseName: SNOWFLAKE_SAMPLE_DATA
    schemaName: WEATHER
    dbType: snowflake
    url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com/?<connection_params>
    userName: ${USER}
    password: ${PASSWORD}

Using External Translator and Custom Attributes

RosettaDB supports custom translators and translation attributes, allowing users to define or extend database-specific configurations via external CSV files. - External Translator: Users can specify a custom CSV file for translations by setting the EXTERNAL_TRANSLATION_FILE environment variable. This file allows adjustments in how database schemas are interpreted. - Translation Attributes: Additional attributes like ordinalPosition, autoincrement, nullable, and primaryKey can be defined in a separate attributes CSV file. Set the EXTERNAL_TRANSLATION_ATTRIBUTE_FILE environment variable to the file’s location to apply these attributes. - Indices: Rosetta supports index definitions in databases like Google Cloud Spanner, configured directly in model.yaml files to manage primary and secondary keys effectively.

For detailed setup instructions and examples, refer here.

Safety Operation

In model.yaml you can find the attribute safeMode which is by default disabled (false). If you want to prevent any DROP operation during apply command, set safeMode: true.

Operation level

In model.yaml you can find the attribute operationLevel which is by default set to schema. If you want to apply changes on to database level in your model instead of the specific schema in apply command, set operationLevel: schema.

Fallback Type

In model.yaml you can define the attribute fallbackType for columns that are of custom types, not supported for translations or not included in the translation matrix. If a given column type cannot be translated then the fallbackType will be used for the translation. fallbackType is optional.

Google Cloud Spanner JDBC Fix

Note: If you face one of the following errors with Google Cloud Spanner JDBC

java.sql.SQLException: No suitable driver

or

java.lang.SecurityException: Invalid signature file digest for Manifest main attributes

you can fix it by running the following command where your driver is located:

zip -d google-cloud-spanner-jdbc-2.6.2-single-jar-with-dependencies.jar 'META-INF/.SF' 'META-INF/.RSA' 'META-INF/*SF'

init

This command will generate a project (directory) if specified, a default configuration file located in the current directory with example connections for bigquery and snowflake, and the model directory.

rosetta init [PROJECT_NAME]

Parameter	Description
(Optional) PROJECT_NAME	Project name (directory) where the configuration file and model directory will be created.

Example:

#example with 2 connections
connections:
  - name: snowflake_weather_prod
    databaseName: SNOWFLAKE_SAMPLE_DATA
    schemaName: WEATHER
    dbType: snowflake
    url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com/?<connection_params>
    userName: bob
    password: bobPassword
  - name: bigquery_prod
    databaseName: bigquery-public-data
    schemaName: breathe
    dbType: bigquery
    url: jdbc:bigquery://[Host]:[Port];ProjectId=[Project];OAuthType= [AuthValue];[Property1]=[Value1];[Property2]=[Value2];...
    userName: user
    password: password
    tables:
      - bigquery_table

validate

This command validates the configuration and tests if rosetta can connect to the configured source.

rosetta [-c, --config CONFIG_FILE] validate [-h, --help] [-s, --source CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection name to extract schema from.

drivers

This command can list drivers that are listed in a drivers.yaml file and by choosing a driver you can download it to the ROSETTA_DRIVERS directory which will be automatically ready to use.

rosetta drivers [-h, --help] [-f, --file] [--list] <indexToDownload> [-dl, --download]

Parameter	Description
-h, --help	Show the help message and exit.
-f, --file DRIVERS_FILE	YAML drivers file path. If none is supplied it will use drivers.yaml in the current directory and then fallback to our default one.
--list	Used to list all available drivers.
-dl, --download	Used to download selected driver by index.
indexToDownload	Chooses which driver to download depending on the index of the driver.

Example (drivers.yaml)

- name: MySQL 8.0.30
  link: https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.30.zip
- name: Postgresql 42.3.7
  link: https://jdbc.postgresql.org/download/postgresql-42.3.7.jar

extract

This is the command that extracts the schema from a database and generates declarative DBML models that can be used for conversion to alternate database targets.

rosetta [-c, --config CONFIG_FILE] extract [-h, --help] [-s, --source CONNECTION_NAME] [-t, --convert-to CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection name to extract schema from.
-t, --convert-to CONNECTION_NAME (Optional)	The target connection name in which source DBML converts to.

Example:

---
safeMode: false
databaseType: bigquery
operationLevel: database
tables:
- name: "profiles"
  type: "TABLE"
  schema: "breathe"
  columns:
  - name: "id"
    typeName: "INT64"
    jdbcDataType: "4"
    ordinalPosition: 0
    primaryKeySequenceId: 1
    columnDisplaySize: 10
    scale: 0
    precision: 10
    primaryKey: false
    nullable: false
    autoincrement: true
  - name: "name"
    typeName: "STRING"
    jdbcDataType: "12"
    ordinalPosition: 0
    primaryKeySequenceId: 0
    columnDisplaySize: 255
    scale: 0
    precision: 255
    primaryKey: false
    nullable: false
    autoincrement: false

compile

The compile command generates DDL (Data Definition Language) statements for a target database based on the DBML (Database Markup Language) extracted from a source database by the previous (extract) command. It builds schemas and tables in the target database using the extracted database schema.

rosetta [-c, --config CONFIG_FILE] compile [-h, --help] [-t, --target CONNECTION_NAME] [-s, --source CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME (Optional)	The source connection name where models are generated.
-t, --target CONNECTION_NAME	The target connection name in which source DBML converts to.
-d, --with-drop	Add query to drop tables when generating ddl.

Example:

CREATE SCHEMA breathe;
CREATE TABLE breathe.profiles(id INTEGER not null AUTO_INCREMENT, name STRING not null);

Example Command:

Assuming main.conf is present in your working directory and configured for both source and target connections, a basic usage example is as follows:

rosetta compile -s source_db_connection -t target_db_connection

This command:

Connects to source_db_connection to retrieve DBML data.
Converts the DBML into DDL specific to target_db_connection.

Additional Notes

The --with-drop option should be used with caution, as it will delete existing tables in the target database.
Ensure that the target connection name is correctly set in main.conf or passed directly as a parameter.

apply

The apply command compares the current database state with the model defined in your Rosetta project. It generates the necessary DDL to align the database with the model and applies the changes to the database. If the git_auto_commit setting in main.conf is set to true, Rosetta will also automatically commit and push the updated model to the associated Git repository.

rosetta [-c, --config CONFIG_FILE] apply [-h, --help] [-s, --source CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection is used to specify which models and connection to use.
-m, --model MODEL_FILE (Optional)	The model file to use for apply. Default is `model.yaml`

Example:

Actual Database (Current State)

---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
  - name: "actor"
    type: "TABLE"
    columns:
      - name: "actor_id"
        typeName: "SMALLINT UNSIGNED"
        ordinalPosition: 0
        primaryKeySequenceId: 1
        columnDisplaySize: 5
        scale: 0
        precision: 5
        nullable: false
        primaryKey: true
        autoincrement: false
        tests:
          assertion:
            - operator: '='
              value: 16
              expected: 1

Expected Database (Target State)

---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
  - name: "actor"
    type: "TABLE"
    columns:
      - name: "actor_id"
        typeName: "SMALLINT UNSIGNED"
        ordinalPosition: 0
        primaryKeySequenceId: 1
        columnDisplaySize: 5
        scale: 0
        precision: 5
        nullable: false
        primaryKey: true
        autoincrement: false
        tests:
          assertion:
            - operator: '='
              value: 16
              expected: 1
      - name: "first_name"
        typeName: "VARCHAR"
        ordinalPosition: 0
        primaryKeySequenceId: 0
        columnDisplaySize: 45
        scale: 0
        precision: 45
        nullable: false
        primaryKey: false
        autoincrement: false
        tests:
          assertion:
            - operator: '!='
              value: 'Michael'
              expected: 1

The apply command detects that the first_name column is missing in the actual database. It generates a DDL statement to alter the actor table and add the first_name column.

Outputs: - A snapshot of the updated database schema is saved in the source directory. - The executed DDL is logged for reference.

diff

The diff command shows the differences between the current local model and the state of the database. This can help identify any tables that have been added or removed, or columns that have been modified in the database schema. It’s a valuable tool for tracking schema changes and maintaining consistency between development and production environments.

rosetta [-c, --config CONFIG_FILE] diff [-h, --help] [-s, --source CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection is used to specify which models and connection to use.
-m, --model MODEL_FILE (Optional)	The model file to use for apply. Default is `model.yaml`

Example Output:

When there are differences between the local model and the targeted database schema, diff provides a detailed report, highlighting table and column changes. Below is a sample output from the diff command:

There are changes between local model and targeted source
Table Changed: Table 'actor' columns changed
Column Changed: Column 'actor_id' in table 'actor' changed 'Precision'. New value: '1', old value: '5'
Column Changed: Column 'actor_id' in table 'actor' changed 'Autoincrement'. New value: 'true', old value: 'false'
Column Changed: Column 'actor_id' in table 'actor' changed 'Primary key'. New value: 'false', old value: 'true'
Column Changed: Column 'actor_id' in table 'actor' changed 'Nullable'. New value: 'true', old value: 'false'
Table Added: Table 'address'

Example Command

To use the diff command with the default configuration file and model file, you might run:

rosetta -s source_db_connection

In this example:

The command compares the source_db_connection schema with the specified local model.
Differences are displayed, such as table and column changes.

Additional Notes

Usage of --model: When using a specific model file other than model.yaml, specify it with the --model parameter.
Table and Column Change Detection: The output categorizes schema differences into table changes, column modifications, and new or removed tables.
Precision in Changes: Each change specifies old and new values, helping identify unintended modifications or updates needed in the target database.

test

This command runs tests for columns using assertions. Then they are translated into query commands, executed, and compared with an expected value. Currently supported assertions are: equals(=), not equals(!=), less than(<), more than(>), less than or equals(<=), more than or equals(>=), contains(in), is null, is not null, like, between. Examples are shown below:

rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME]

rosetta [-c, --config CONFIG_FILE] test [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection is used to specify which models and connections to use.
-t, --target CONNECTION_NAME (Optional)	The target connection is used to specify the target connection to use for testing the data. The source tests needs to match the values from the tarrget connection.

Note: Value for BigQuery Array columns should be comma separated value ('a,b,c,d,e').

Example:

---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
  - name: "actor"
    type: "TABLE"
    columns:
      - name: "actor_id"
        typeName: "SMALLINT UNSIGNED"
        ordinalPosition: 0
        primaryKeySequenceId: 1
        columnDisplaySize: 5
        scale: 0
        precision: 5
        nullable: false
        primaryKey: true
        autoincrement: false
        tests:
          assertion:
            - operator: '='
              value: 16
              expected: 1
      - name: "first_name"
        typeName: "VARCHAR"
        ordinalPosition: 0
        primaryKeySequenceId: 0
        columnDisplaySize: 45
        scale: 0
        precision: 45
        nullable: false
        primaryKey: false
        autoincrement: false
        tests:
          assertion:
            - operator: '!='
              value: 'Michael'
              expected: 1

When running the tests against a target connection, you don't have to specify the expected value.

---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
  - name: "actor"
    type: "TABLE"
    columns:
      - name: "actor_id"
        typeName: "SMALLINT UNSIGNED"
        ordinalPosition: 0
        primaryKeySequenceId: 1
        columnDisplaySize: 5
        scale: 0
        precision: 5
        nullable: false
        primaryKey: true
        autoincrement: false
        tests:
          assertion:
            - operator: '='
              value: 16
      - name: "first_name"
        typeName: "VARCHAR"
        ordinalPosition: 0
        primaryKeySequenceId: 0
        columnDisplaySize: 45
        scale: 0
        precision: 45
        nullable: false
        primaryKey: false
        autoincrement: false
        tests:
          assertion:
            - operator: '!='
              value: 'Michael'

If you need to overwrite the test column query (e.x. for Geospatial data), you can use columnDef.

---
safeMode: false
databaseType: "mysql"
operationLevel: database
tables:
  - name: "actor"
    type: "TABLE"
    columns:
      - name: "actor_id"
        typeName: "SMALLINT UNSIGNED"
        ordinalPosition: 0
        primaryKeySequenceId: 1
        columnDisplaySize: 5
        scale: 0
        precision: 5
        nullable: false
        primaryKey: true
        autoincrement: false
        tests:
          assertion:
            - operator: '='
              value: 16
              expected: 1
      - name: "wkt"
        typeName: "GEOMETRY"
        ordinalPosition: 0
        primaryKeySequenceId: 0
        columnDisplaySize: 1000000000
        scale: 0
        precision: 1000000000
        columnProperties: []
        nullable: true
        primaryKey: false
        autoincrement: false
        tests:
          assertion:
            - operator: '>'
              value: 434747
              expected: 4
              columnDef: 'ST_AREA(wkt, 1)'

Output example:

Running tests for mysql. Found: 2

1 of 2, RUNNING test ('=') on column: 'actor_id'                                                    
1 of 2, FINISHED test on column: 'actor_id' (expected: '1' - actual: '1')  ......................... [PASS in 0.288s]
2 of 2, RUNNING test ('!=') on column: 'first_name'                                                 
2 of 2, FINISHED test on column: 'first_name' (expected: '1' - actual: '219')  ..................... [FAIL in 0.091s]

dbt

The dbt command generates dbt models based on the DBML (Database Markup Language) extracted from a source database. This DBML should have been generated by the previous (extract) command, providing a foundation for creating structured data transformations within dbt.

rosetta [-c, --config CONFIG_FILE] dbt [-h, --help] [-s, --source CONNECTION_NAME]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection name where models are generated.

Example Command:

Here’s a basic example command that uses the dbt function:

rosetta dbt -s source_db_connection

This command will:

Use source_db_connection to locate the DBML generated from the extract command.
Generate corresponding dbt models that reflect the structure of the source database.

Additional Notes

Integration with dbt: The generated dbt models allow for scalable and reusable SQL transformations, helping align your data structure with your analytics or ETL workflows.
Configuration: Ensure that the configuration file (main.conf or specified config) contains accurate connection details for the source database, as it serves as the base for generating dbt models.

generate

This command will generate Spark Python (file) or Spark Scala (file), firstly it extracts a schema from a source database and gets connection properties from the source connection, then it creates a python (file) or scala (file) that translates schemas, which is ready to transfer data from source to target.

rosetta [-c, --config CONFIG_FILE] generate [-h, --help] [-s, --source CONNECTION_NAME] [-t, --target CONNECTION_NAME] [--pyspark] [--scala]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection name to extract schema from.
-t, --target CONNECTION_NAME	The target connection name where the data will be transfered.
--pyspark	Generates the Spark SQL file.
--scala	Generates the Scala SQL file.

Example Command:

Here’s a basic example command that uses the generate function:

rosetta generate -s source_db_connection -t target_db_connection --pyspark

This command will:

Connect to the specified source and target databases using the connection details provided.
Extract the schema from the source.
Generate a PySpark or Scala script, depending on the selected flag (--pyspark or --scala), which is ready to transfer data from source to target.

Additional Notes

JDBC Drivers: Ensure you have the correct JDBC drivers for both the source and target databases. These drivers should be specified in the spark.driver.extraClassPath.
Database Configuration: Modify the source_jdbc_url ,target_jdbc_url, and other connection parameters as per your environment setup.
Mode Options: The mode("overwrite") option in .save() will overwrite any existing data in the target table. Change it as needed (e.g., append, ignore, error).

query

The query command allows you to use natural language commands to query your databases, transforming these commands into SQL SELECT statements. By leveraging the capabilities of AI and LLMs, specifically OpenAI models, it interprets user queries and generates the corresponding SQL queries. For effective use of this command, users need to provide their OpenAI API Key and specify the OpenAI model to be utilized. The output will be written to a CSV file. The max number of rows that will be returned is 200. You can overwrite this value, or remove completely the limit. The default openai model that is used is gpt-3.5-turbo.

rosetta [-c, --config CONFIG_FILE] query [-h, --help] [-s, --source CONNECTION_NAME] [-q, --query "Natural language QUERY"] [--output "Output DIRECTORY or FILE"]

Parameter	Description
-h, --help	Show the help message and exit.
-c, --config CONFIG_FILE	YAML config file. If none is supplied it will use main.conf in the current directory if it exists.
-s, --source CONNECTION_NAME	The source connection is used to specify which models and connection to use.
-q --query "Natural language QUERY"	pecifies the natural language query to be transformed into an SQL SELECT statement.
-l --limit Response Row limit (Optional)	Limits the number of rows in the generated CSV file. If not specified, the default limit is set to 200 rows.
--no-limit (Optional)	Specifies that there should be no limit on the number of rows in the generated CSV file.

Example (Setting the key and model) :

(Config file)

openai_api_key: "sk-abcdefghijklmno1234567890"
openai_model: "gpt-4"
connections:
  - name: mysql
    databaseName: sakila
    schemaName:
    dbType: mysql
    url: jdbc:mysql://root:sakila@localhost:3306/sakila
    userName: root
    password: sakila
  - name: pg
    databaseName: postgres
    schemaName: public
    dbType: postgres
    url: jdbc:postgresql://localhost:5432/postgres?user=postgres&password=sakila
    userName: postgres
    password: sakila

Example (Query)

   rosetta query -s mysql -q "Show me the top 10 customers by revenue."

CSV Output Example

customer_name,total_revenue,location,email
John Doe,50000,New York,johndoe@example.com
Jane Smith,45000,Los Angeles,janesmith@example.com
David Johnson,40000,Chicago,davidjohnson@example.com
Emily Brown,35000,San Francisco,emilybrown@example.com
Michael Lee,30000,Miami,michaellee@example.com
Sarah Taylor,25000,Seattle,sarahtaylor@example.com
Robert Clark,20000,Boston,robertclark@example.com
Lisa Martinez,15000,Denver,lisamartinez@example.com
Christopher Anderson,10000,Austin,christopheranderson@example.com
Amanda Wilson,5000,Atlanta,amandawilson@example.com

Note: When giving a request that will not generate a SELECT statement the query will be generated but will not be executed rather be given to the user to execute on their own.