FAQ + Did You Know?¶
In this section are gathered a few ‘Frequently Asked Questions’ and ‘Did You Know’ tricks that are useful to learn when using flamy.
Don’t hesitate to come back to this page often to refresh you memory or learn new things.
Frequently Asked Questions¶
Be the first to ask a question!
https://groups.google.com/forum/#!forum/flamy
(But before you do, please make sure this section doesn’t already answer it)
Did you know?¶
Folder architecture¶
When flamy scans the model folder, it looks recursively for folders ending in .db
This means that you can regroup your schemas in subfolders if you want.
The only constraint are that the folders corresponding to the tables
must be directly inside the schema (.db
) folder.
The table folder may then contain CREATE.hql
, POPULATE.hql
,
VIEW.hql
and META.properties
files.
You can safely add other type of files in theses directories,
they will be ignored by flamy, but we plan to extend the set
of files recognized by flamy in the future.
For instance, this folder structure is allowed:
model
├── schema0.db
│ └── table0
│ ├── CREATE.hql
│ └── comments.txt
│ └── work_in_progress.hql
├── project_A
│ ├── schemaA1.db
│ │ └── tableA1a
│ │ └── CREATE.hql
└── project_B
└── schemaB1.db
└── tableA21
└── CREATE.hql
The configuration flamy.model.dir.paths
allows you to specify multiple folders,
if you want to separate your projects even more.
Schema properties¶
If you want to create a schema with specific properties (location, comment), you can
add a CREATE_SCHEMA.hql
inside the schema (.db) folder, which will contain the CREATE statement of your schema.
Flamy will safely ignore the location when dry-running locally.
What about views?¶
Flamy supports views. To create a view with flamy, all you have to do is to write a VIEW.hql
statement with the CREATE VIEW statement instead of the CREATE.hql
.
Views are treated as table when possible, which means that the show tables
, describe tables
command will correctly list them,
and the push tables
command will correctly push them, in the right order.
Multiple POPULATEs on the same table?¶
Flamy allows you to write multiple queries separated by semicolons ;
in the same POPULATE.hql
,
in such case, the queries will always be run together and sequentially.
But you can also have multiple POPULATE files, by using a suffix of the form _suffix
.
In such case, when possible flamy will execute all the POPULATE files of a given table in parallel.
For instance a common pattern in Hive for a table aggregating data from two sources,
is to partition it by source and to have one Hive query per source.
In such case you could write a POPULATE_sourceA.hql
and a POPULATE_sourceB.hql
file to keep the two logics separated
and be able to execute both queries in parallel.
Presets files¶
For any environment myEnv
you have configured (including the ‘model’ environment),
you can set the configuration parameter flamy.env.<ENV>.hive.presets.path
to make it point to a .hql
file that may contains several commands that will be performed
before every query session on this environment.
For instance, if your cluster prevents dynamic partitioning by default, you can add
this line in your presets file to enable it for all your queries.
SET hive.exec.dynamic.partition.mode = nonstrict ;
This file is also required to handle custom UDFs, as explained in the next paragraph.
Custom UDFs¶
One of Hive’s main advantages is that it is quite easy to create and use custom UDFs.
If you have custom UDFs, when using the check long
or the run --dry
command locally,
you have to make sure that flamy has access to the custom UDF jar and that the functions
are correctly defined in the model presets.
This is how to proceed:
- Set the
flamy.udf.classpath
configuration parameter to point to the jar(s) containing your custom UDFs. - Create a PRESETS_model.hql file and set
flamy.env.model.hive.presets.path
to point to it. - In the presets file, add one line to create each function you want to use
CREATE TEMPORARY FUNCTION my_function AS "com.example.hive.udf.GenericUDFMyFunction" ;
What about non-Hive (e.g. Spark) jobs?¶
We all agree that SQL is great at performing some tasks, and very poor at others,
which is why our most complex jobs in our workflow are done with pure-Scala Spark jobs.
To handle these Spark dependencies between two tables, add a file called META.properties
in the destination table folder and indicate the name of the source tables of your spark job like this:
dependencies = schema.source_table_1, schema.source_table_2
When displaying the dependency graph with show graph
, flamy will now add blue arrows in the graph
to represent these external dependencies.
Unfortunately, for now, flamy is not capable of handling Spark job, and we usually used a regular scheduler
to populate all the tables required by the spark job with one flamy run
command, then started
the spark job, and finally populated all the tables downstream with another flamy run
command.
Better handling for Spark jobs is part of the new features we would like to develop, although we know that since Spark is much more permissive than the SQL syntax, some features, like the automatic dependency discovery or the dry-run will be difficult to extend to Spark.
For jobs at the interface between the Hive cluster and other services, we used our regular scheduler, and flamy was no help here. However some of its feature like the graph and the dry-run could be a source of inspiration for designing similar features in a scheduler.