traildb-0.6+dfsg1/0000700000175000017500000000000013107212256013273 5ustar czchenczchentraildb-0.6+dfsg1/Makefile.am0000600000175000017500000000340613106440271015333 0ustar czchenczchenlib_LTLIBRARIES = libtraildb.la ACLOCAL_AMFLAGS = -I m4 libtraildb_la_CFLAGS = -std=c99 \ -DJUDYERROR=judyerror_macro_missing_fix_this \ -O3 \ -fvisibility=hidden \ -g \ -Wall \ -Wextra \ -Wconversion \ -Wcast-qual \ -Wformat-security \ -Wmissing-declarations \ -Wmissing-prototypes \ -Wnested-externs \ -Wpointer-arith \ -Wshadow \ -Wstrict-prototypes AM_CPPFLAGS = -Isrc/xxhash -Isrc/dsfmt AM_CFLAGS=-O3 -g -fvisibility=hidden libtraildb_la_LIBADD = src/xxhash/xxhash.lo src/dsfmt/dSFMT.lo libtraildb_la_SOURCES = \ src/tdb.c \ src/tdb_cons.c \ src/tdb_uuid.c \ src/tdb_decode.c \ src/tdb_encode.c \ src/tdb_encode_model.c \ src/tdb_queue.c \ src/tdb_huffman.c \ src/tdb_cons_package.c \ src/tdb_package.c \ src/arena.c \ src/judy_str_map.c \ src/judy_128_map.c EXTRA_libtraildb_la_SOURCES = src/xxhash/xxhash.c src/dsfmt/dSFMT.c include_HEADERS = \ src/traildb.h \ src/tdb_error.h \ src/tdb_types.h \ src/tdb_limits.h bin_PROGRAMS = util/traildb_bench tdbcli/tdb util_traildb_bench_SOURCES = util/traildb_bench.c util_traildb_bench_CFLAGS = ${libtraildb_la_CFLAGS} -Isrc/ util_traildb_bench_LDADD = libtraildb.la tdbcli_tdb_CFLAGS = -Isrc/ \ -DJUDYERROR=judyerror_macro_missing_fix_this \ -O3 \ -g \ -Wall tdbcli_tdb_SOURCES = tdbcli/main.c tdbcli/op_dump.c tdbcli/op_make.c tdbcli/jsmn/jsmn.c tdbcli_tdb_LDADD = libtraildb.la traildb-0.6+dfsg1/CHANGELOG.md0000600000175000017500000000635013106440271015111 0ustar czchenczchen ## 0.6 (2017-05-15) ### New features - Select a subset of trails with `tdb cli` using the `--uuids` flag. - Optimized filters that match all events or no events. These can be used to [create a (materialized) view over a subset of trails](http://traildb.io/docs/technical_overview/#whitelist-or-blacklist-trails-a-view-over-a-subset-of-trails). - `tdb merge` supports merging of TrailDBs with mismatching sets of fields. The result is a union of all fields in the source TrailDBs. - `TDB_OPT_CONS_NO_BIGRAMS` option for [tdb_cons_set_opt](http://traildb.io/docs/api/#tdb_cons_set_opt) to disable bigram-based size optimization. This option can sometimes greatly speed up TrailDB creation at the cost of increased filesize. The flag can also be passed to `tdb` CLI as `--no-bigrams`. - Trail-level options: [tdb_set_trail_opt](http://traildb.io/docs/api/#tdb_set_trail_opt). This is especially useful for creating granular views using `TDB_OPT_EVENT_FILTER`. See [Setting Options](http://traildb.io/docs/api/#setting-options). - Time-range term: [query events within a given time-range](http://traildb.io/docs/api/#tdb_event_filter_add_time_range). This simplifies time-series type analyses of trails. Also expanded the filter inspection API to add functions for [counting the number of terms in a clause](http://traildb.io/docs/api/#tdb_event_filter_num_terms), [inspecting the type of a term](http://traildb.io/docs/api/#tdb_event_filter_get_term_type), and [returning the start and end times of a time-range term](http://traildb.io/docs/api/#tdb_event_filter_get_time_range). - Multi-cursors: [join trails over multiple TrailDBs efficiently](http://traildb.io/docs/api/#join-trails-with-multi-cursors). This is a convenient way to stich together e.g. time-sharded TrailDBs or merge together user profiles stored under separate UUIDs. - Item index for `tdb` CLI. This can speed up `--filter` expressions that access infrequent items significantly. - Added [tdb_event_filter_num_clauses](http://traildb.io/docs/api/#tdb_event_filter_num_clauses) and [tdb_event_filter_get_item](http://traildb.io/docs/api/#tdb_event_filter_get_item) for reading items and clauses in an existing filter. - `TDB_OPT_EVENT_FILTER` option for `tdb_set_opt` which can be used to create [views](http://traildb.io/docs/technical_overview/#return-a-subset-of-events-with-event-filters) or [materialized views](http://traildb.io/docs/technical_overview/#create-traildb-extracts-materialized-views). - `--filter` flag for the `tdb` CLI: Define event filters on the command line for easy grepping of events. Also added `--verbose` flag for troubleshooting filters. - `tdb merge` for `tdb` CLI. This operation is used to merge multipled tdbs into a single tdb. - Added a `brew` package for OS X. - Added installation instructions for FreeBSD. ### Bugfixes - Make opening single-file tdbs thread-safe. - Fix handling of empty values in `tdb_cons_append`. - Fix handling of disk full situations in `tdb_cons_append`. - Fix semantics of how `TDB_OPT_EVENT_FILTER` filters are applied to cursors. Now the changes are applied at every call to `tdb_get_trail`, not at the creation of the cursor. ## 0.5 (2016-05-24) Initial open-source release. traildb-0.6+dfsg1/src/0000700000175000017500000000000013106440271014061 5ustar czchenczchentraildb-0.6+dfsg1/src/dsfmt/0000700000175000017500000000000013106440271015176 5ustar czchenczchentraildb-0.6+dfsg1/src/dsfmt/dSFMT.c0000600000175000017500000004657513106440271016302 0ustar czchenczchen/** * @file dSFMT.c * @brief double precision SIMD-oriented Fast Mersenne Twister (dSFMT) * based on IEEE 754 format. * * @author Mutsuo Saito (Hiroshima University) * @author Makoto Matsumoto (Hiroshima University) * * Copyright (C) 2007,2008 Mutsuo Saito, Makoto Matsumoto and Hiroshima * University. All rights reserved. * * The new BSD License is applied to this software, see LICENSE.txt */ #include #include #include #include "dSFMT-params.h" #include "dSFMT-common.h" #if defined(__cplusplus) extern "C" { #endif /** dsfmt internal state vector */ dsfmt_t dsfmt_global_data; /** dsfmt mexp for check */ static const int dsfmt_mexp = DSFMT_MEXP; /*---------------- STATIC FUNCTIONS ----------------*/ inline static uint32_t ini_func1(uint32_t x); inline static uint32_t ini_func2(uint32_t x); inline static void gen_rand_array_c1o2(dsfmt_t *dsfmt, w128_t *array, int size); inline static void gen_rand_array_c0o1(dsfmt_t *dsfmt, w128_t *array, int size); inline static void gen_rand_array_o0c1(dsfmt_t *dsfmt, w128_t *array, int size); inline static void gen_rand_array_o0o1(dsfmt_t *dsfmt, w128_t *array, int size); inline static int idxof(int i); static void initial_mask(dsfmt_t *dsfmt); static void period_certification(dsfmt_t *dsfmt); #if defined(HAVE_SSE2) /** 1 in 64bit for sse2 */ static const union X128I_T sse2_int_one = {{1, 1}}; /** 2.0 double for sse2 */ static const union X128D_T sse2_double_two = {{2.0, 2.0}}; /** -1.0 double for sse2 */ static const union X128D_T sse2_double_m_one = {{-1.0, -1.0}}; #endif /** * This function simulate a 32-bit array index overlapped to 64-bit * array of LITTLE ENDIAN in BIG ENDIAN machine. */ #if defined(DSFMT_BIG_ENDIAN) inline static int idxof(int i) { return i ^ 1; } #else inline static int idxof(int i) { return i; } #endif #if defined(HAVE_SSE2) /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range [0, 1). * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_c0o1(w128_t *w) { w->sd = _mm_add_pd(w->sd, sse2_double_m_one.d128); } /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range (0, 1]. * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_o0c1(w128_t *w) { w->sd = _mm_sub_pd(sse2_double_two.d128, w->sd); } /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range (0, 1). * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_o0o1(w128_t *w) { w->si = _mm_or_si128(w->si, sse2_int_one.i128); w->sd = _mm_add_pd(w->sd, sse2_double_m_one.d128); } #else /* standard C and altivec */ /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range [0, 1). * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_c0o1(w128_t *w) { w->d[0] -= 1.0; w->d[1] -= 1.0; } /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range (0, 1]. * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_o0c1(w128_t *w) { w->d[0] = 2.0 - w->d[0]; w->d[1] = 2.0 - w->d[1]; } /** * This function converts the double precision floating point numbers which * distribute uniformly in the range [1, 2) to those which distribute uniformly * in the range (0, 1). * @param w 128bit stracture of double precision floating point numbers (I/O) */ inline static void convert_o0o1(w128_t *w) { w->u[0] |= 1; w->u[1] |= 1; w->d[0] -= 1.0; w->d[1] -= 1.0; } #endif /** * This function fills the user-specified array with double precision * floating point pseudorandom numbers of the IEEE 754 format. * @param dsfmt dsfmt state vector. * @param array an 128-bit array to be filled by pseudorandom numbers. * @param size number of 128-bit pseudorandom numbers to be generated. */ inline static void gen_rand_array_c1o2(dsfmt_t *dsfmt, w128_t *array, int size) { int i, j; w128_t lung; lung = dsfmt->status[DSFMT_N]; do_recursion(&array[0], &dsfmt->status[0], &dsfmt->status[DSFMT_POS1], &lung); for (i = 1; i < DSFMT_N - DSFMT_POS1; i++) { do_recursion(&array[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1], &lung); } for (; i < DSFMT_N; i++) { do_recursion(&array[i], &dsfmt->status[i], &array[i + DSFMT_POS1 - DSFMT_N], &lung); } for (; i < size - DSFMT_N; i++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); } for (j = 0; j < 2 * DSFMT_N - size; j++) { dsfmt->status[j] = array[j + size - DSFMT_N]; } for (; i < size; i++, j++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); dsfmt->status[j] = array[i]; } dsfmt->status[DSFMT_N] = lung; } /** * This function fills the user-specified array with double precision * floating point pseudorandom numbers of the IEEE 754 format. * @param dsfmt dsfmt state vector. * @param array an 128-bit array to be filled by pseudorandom numbers. * @param size number of 128-bit pseudorandom numbers to be generated. */ inline static void gen_rand_array_c0o1(dsfmt_t *dsfmt, w128_t *array, int size) { int i, j; w128_t lung; lung = dsfmt->status[DSFMT_N]; do_recursion(&array[0], &dsfmt->status[0], &dsfmt->status[DSFMT_POS1], &lung); for (i = 1; i < DSFMT_N - DSFMT_POS1; i++) { do_recursion(&array[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1], &lung); } for (; i < DSFMT_N; i++) { do_recursion(&array[i], &dsfmt->status[i], &array[i + DSFMT_POS1 - DSFMT_N], &lung); } for (; i < size - DSFMT_N; i++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); convert_c0o1(&array[i - DSFMT_N]); } for (j = 0; j < 2 * DSFMT_N - size; j++) { dsfmt->status[j] = array[j + size - DSFMT_N]; } for (; i < size; i++, j++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); dsfmt->status[j] = array[i]; convert_c0o1(&array[i - DSFMT_N]); } for (i = size - DSFMT_N; i < size; i++) { convert_c0o1(&array[i]); } dsfmt->status[DSFMT_N] = lung; } /** * This function fills the user-specified array with double precision * floating point pseudorandom numbers of the IEEE 754 format. * @param dsfmt dsfmt state vector. * @param array an 128-bit array to be filled by pseudorandom numbers. * @param size number of 128-bit pseudorandom numbers to be generated. */ inline static void gen_rand_array_o0o1(dsfmt_t *dsfmt, w128_t *array, int size) { int i, j; w128_t lung; lung = dsfmt->status[DSFMT_N]; do_recursion(&array[0], &dsfmt->status[0], &dsfmt->status[DSFMT_POS1], &lung); for (i = 1; i < DSFMT_N - DSFMT_POS1; i++) { do_recursion(&array[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1], &lung); } for (; i < DSFMT_N; i++) { do_recursion(&array[i], &dsfmt->status[i], &array[i + DSFMT_POS1 - DSFMT_N], &lung); } for (; i < size - DSFMT_N; i++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); convert_o0o1(&array[i - DSFMT_N]); } for (j = 0; j < 2 * DSFMT_N - size; j++) { dsfmt->status[j] = array[j + size - DSFMT_N]; } for (; i < size; i++, j++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); dsfmt->status[j] = array[i]; convert_o0o1(&array[i - DSFMT_N]); } for (i = size - DSFMT_N; i < size; i++) { convert_o0o1(&array[i]); } dsfmt->status[DSFMT_N] = lung; } /** * This function fills the user-specified array with double precision * floating point pseudorandom numbers of the IEEE 754 format. * @param dsfmt dsfmt state vector. * @param array an 128-bit array to be filled by pseudorandom numbers. * @param size number of 128-bit pseudorandom numbers to be generated. */ inline static void gen_rand_array_o0c1(dsfmt_t *dsfmt, w128_t *array, int size) { int i, j; w128_t lung; lung = dsfmt->status[DSFMT_N]; do_recursion(&array[0], &dsfmt->status[0], &dsfmt->status[DSFMT_POS1], &lung); for (i = 1; i < DSFMT_N - DSFMT_POS1; i++) { do_recursion(&array[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1], &lung); } for (; i < DSFMT_N; i++) { do_recursion(&array[i], &dsfmt->status[i], &array[i + DSFMT_POS1 - DSFMT_N], &lung); } for (; i < size - DSFMT_N; i++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); convert_o0c1(&array[i - DSFMT_N]); } for (j = 0; j < 2 * DSFMT_N - size; j++) { dsfmt->status[j] = array[j + size - DSFMT_N]; } for (; i < size; i++, j++) { do_recursion(&array[i], &array[i - DSFMT_N], &array[i + DSFMT_POS1 - DSFMT_N], &lung); dsfmt->status[j] = array[i]; convert_o0c1(&array[i - DSFMT_N]); } for (i = size - DSFMT_N; i < size; i++) { convert_o0c1(&array[i]); } dsfmt->status[DSFMT_N] = lung; } /** * This function represents a function used in the initialization * by init_by_array * @param x 32-bit integer * @return 32-bit integer */ static uint32_t ini_func1(uint32_t x) { return (x ^ (x >> 27)) * (uint32_t)1664525UL; } /** * This function represents a function used in the initialization * by init_by_array * @param x 32-bit integer * @return 32-bit integer */ static uint32_t ini_func2(uint32_t x) { return (x ^ (x >> 27)) * (uint32_t)1566083941UL; } /** * This function initializes the internal state array to fit the IEEE * 754 format. * @param dsfmt dsfmt state vector. */ static void initial_mask(dsfmt_t *dsfmt) { int i; uint64_t *psfmt; psfmt = &dsfmt->status[0].u[0]; for (i = 0; i < DSFMT_N * 2; i++) { psfmt[i] = (psfmt[i] & DSFMT_LOW_MASK) | DSFMT_HIGH_CONST; } } /** * This function certificate the period of 2^{SFMT_MEXP}-1. * @param dsfmt dsfmt state vector. */ static void period_certification(dsfmt_t *dsfmt) { uint64_t pcv[2] = {DSFMT_PCV1, DSFMT_PCV2}; uint64_t tmp[2]; uint64_t inner; int i; #if (DSFMT_PCV2 & 1) != 1 int j; uint64_t work; #endif tmp[0] = (dsfmt->status[DSFMT_N].u[0] ^ DSFMT_FIX1); tmp[1] = (dsfmt->status[DSFMT_N].u[1] ^ DSFMT_FIX2); inner = tmp[0] & pcv[0]; inner ^= tmp[1] & pcv[1]; for (i = 32; i > 0; i >>= 1) { inner ^= inner >> i; } inner &= 1; /* check OK */ if (inner == 1) { return; } /* check NG, and modification */ #if (DSFMT_PCV2 & 1) == 1 dsfmt->status[DSFMT_N].u[1] ^= 1; #else for (i = 1; i >= 0; i--) { work = 1; for (j = 0; j < 64; j++) { if ((work & pcv[i]) != 0) { dsfmt->status[DSFMT_N].u[i] ^= work; return; } work = work << 1; } } #endif return; } /*---------------- PUBLIC FUNCTIONS ----------------*/ /** * This function returns the identification string. The string shows * the Mersenne exponent, and all parameters of this generator. * @return id string. */ const char *dsfmt_get_idstring(void) { return DSFMT_IDSTR; } /** * This function returns the minimum size of array used for \b * fill_array functions. * @return minimum size of array used for fill_array functions. */ int dsfmt_get_min_array_size(void) { return DSFMT_N64; } /** * This function fills the internal state array with double precision * floating point pseudorandom numbers of the IEEE 754 format. * @param dsfmt dsfmt state vector. */ void dsfmt_gen_rand_all(dsfmt_t *dsfmt) { int i; w128_t lung; lung = dsfmt->status[DSFMT_N]; do_recursion(&dsfmt->status[0], &dsfmt->status[0], &dsfmt->status[DSFMT_POS1], &lung); for (i = 1; i < DSFMT_N - DSFMT_POS1; i++) { do_recursion(&dsfmt->status[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1], &lung); } for (; i < DSFMT_N; i++) { do_recursion(&dsfmt->status[i], &dsfmt->status[i], &dsfmt->status[i + DSFMT_POS1 - DSFMT_N], &lung); } dsfmt->status[DSFMT_N] = lung; } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range [1, 2) to the * specified array[] by one call. The number of pseudorandom numbers * is specified by the argument \b size, which must be at least (SFMT_MEXP * / 128) * 2 and a multiple of two. The function * get_min_array_size() returns this minimum size. The generation by * this function is much faster than the following fill_array_xxx functions. * * For initialization, init_gen_rand() or init_by_array() must be called * before the first call of this function. This function can not be * used after calling genrand_xxx functions, without initialization. * * @param dsfmt dsfmt state vector. * @param array an array where pseudorandom numbers are filled * by this function. The pointer to the array must be "aligned" * (namely, must be a multiple of 16) in the SIMD version, since it * refers to the address of a 128-bit integer. In the standard C * version, the pointer is arbitrary. * * @param size the number of 64-bit pseudorandom integers to be * generated. size must be a multiple of 2, and greater than or equal * to (SFMT_MEXP / 128) * 2. * * @note \b memalign or \b posix_memalign is available to get aligned * memory. Mac OSX doesn't have these functions, but \b malloc of OSX * returns the pointer to the aligned memory block. */ void dsfmt_fill_array_close1_open2(dsfmt_t *dsfmt, double array[], int size) { assert(size % 2 == 0); assert(size >= DSFMT_N64); gen_rand_array_c1o2(dsfmt, (w128_t *)array, size / 2); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range (0, 1] to the * specified array[] by one call. This function is the same as * fill_array_close1_open2() except the distribution range. * * @param dsfmt dsfmt state vector. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa fill_array_close1_open2() */ void dsfmt_fill_array_open_close(dsfmt_t *dsfmt, double array[], int size) { assert(size % 2 == 0); assert(size >= DSFMT_N64); gen_rand_array_o0c1(dsfmt, (w128_t *)array, size / 2); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range [0, 1) to the * specified array[] by one call. This function is the same as * fill_array_close1_open2() except the distribution range. * * @param array an array where pseudorandom numbers are filled * by this function. * @param dsfmt dsfmt state vector. * @param size the number of pseudorandom numbers to be generated. * see also \sa fill_array_close1_open2() */ void dsfmt_fill_array_close_open(dsfmt_t *dsfmt, double array[], int size) { assert(size % 2 == 0); assert(size >= DSFMT_N64); gen_rand_array_c0o1(dsfmt, (w128_t *)array, size / 2); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range (0, 1) to the * specified array[] by one call. This function is the same as * fill_array_close1_open2() except the distribution range. * * @param dsfmt dsfmt state vector. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa fill_array_close1_open2() */ void dsfmt_fill_array_open_open(dsfmt_t *dsfmt, double array[], int size) { assert(size % 2 == 0); assert(size >= DSFMT_N64); gen_rand_array_o0o1(dsfmt, (w128_t *)array, size / 2); } #if defined(__INTEL_COMPILER) # pragma warning(disable:981) #endif /** * This function initializes the internal state array with a 32-bit * integer seed. * @param dsfmt dsfmt state vector. * @param seed a 32-bit integer used as the seed. * @param mexp caller's mersenne expornent */ void dsfmt_chk_init_gen_rand(dsfmt_t *dsfmt, uint32_t seed, int mexp) { int i; uint32_t *psfmt; /* make sure caller program is compiled with the same MEXP */ if (mexp != dsfmt_mexp) { fprintf(stderr, "DSFMT_MEXP doesn't match with dSFMT.c\n"); exit(1); } psfmt = &dsfmt->status[0].u32[0]; psfmt[idxof(0)] = seed; for (i = 1; i < (DSFMT_N + 1) * 4; i++) { psfmt[idxof(i)] = 1812433253UL * (psfmt[idxof(i - 1)] ^ (psfmt[idxof(i - 1)] >> 30)) + i; } initial_mask(dsfmt); period_certification(dsfmt); dsfmt->idx = DSFMT_N64; } /** * This function initializes the internal state array, * with an array of 32-bit integers used as the seeds * @param dsfmt dsfmt state vector. * @param init_key the array of 32-bit integers, used as a seed. * @param key_length the length of init_key. * @param mexp caller's mersenne expornent */ void dsfmt_chk_init_by_array(dsfmt_t *dsfmt, uint32_t init_key[], int key_length, int mexp) { int i, j, count; uint32_t r; uint32_t *psfmt32; int lag; int mid; int size = (DSFMT_N + 1) * 4; /* pulmonary */ /* make sure caller program is compiled with the same MEXP */ if (mexp != dsfmt_mexp) { fprintf(stderr, "DSFMT_MEXP doesn't match with dSFMT.c\n"); exit(1); } if (size >= 623) { lag = 11; } else if (size >= 68) { lag = 7; } else if (size >= 39) { lag = 5; } else { lag = 3; } mid = (size - lag) / 2; psfmt32 = &dsfmt->status[0].u32[0]; memset(dsfmt->status, 0x8b, sizeof(dsfmt->status)); if (key_length + 1 > size) { count = key_length + 1; } else { count = size; } r = ini_func1(psfmt32[idxof(0)] ^ psfmt32[idxof(mid % size)] ^ psfmt32[idxof((size - 1) % size)]); psfmt32[idxof(mid % size)] += r; r += key_length; psfmt32[idxof((mid + lag) % size)] += r; psfmt32[idxof(0)] = r; count--; for (i = 1, j = 0; (j < count) && (j < key_length); j++) { r = ini_func1(psfmt32[idxof(i)] ^ psfmt32[idxof((i + mid) % size)] ^ psfmt32[idxof((i + size - 1) % size)]); psfmt32[idxof((i + mid) % size)] += r; r += init_key[j] + i; psfmt32[idxof((i + mid + lag) % size)] += r; psfmt32[idxof(i)] = r; i = (i + 1) % size; } for (; j < count; j++) { r = ini_func1(psfmt32[idxof(i)] ^ psfmt32[idxof((i + mid) % size)] ^ psfmt32[idxof((i + size - 1) % size)]); psfmt32[idxof((i + mid) % size)] += r; r += i; psfmt32[idxof((i + mid + lag) % size)] += r; psfmt32[idxof(i)] = r; i = (i + 1) % size; } for (j = 0; j < size; j++) { r = ini_func2(psfmt32[idxof(i)] + psfmt32[idxof((i + mid) % size)] + psfmt32[idxof((i + size - 1) % size)]); psfmt32[idxof((i + mid) % size)] ^= r; r -= i; psfmt32[idxof((i + mid + lag) % size)] ^= r; psfmt32[idxof(i)] = r; i = (i + 1) % size; } initial_mask(dsfmt); period_certification(dsfmt); dsfmt->idx = DSFMT_N64; } #if defined(__INTEL_COMPILER) # pragma warning(default:981) #endif #if defined(__cplusplus) } #endif traildb-0.6+dfsg1/src/dsfmt/dSFMT-params.h0000600000175000017500000000460513106440271017554 0ustar czchenczchen#ifndef DSFMT_PARAMS_H #define DSFMT_PARAMS_H #include "dSFMT.h" /*---------------------- the parameters of DSFMT following definitions are in dSFMT-paramsXXXX.h file. ----------------------*/ /** the pick up position of the array. #define DSFMT_POS1 122 */ /** the parameter of shift left as four 32-bit registers. #define DSFMT_SL1 18 */ /** the parameter of shift right as four 32-bit registers. #define DSFMT_SR1 12 */ /** A bitmask, used in the recursion. These parameters are introduced * to break symmetry of SIMD. #define DSFMT_MSK1 (uint64_t)0xdfffffefULL #define DSFMT_MSK2 (uint64_t)0xddfecb7fULL */ /** These definitions are part of a 128-bit period certification vector. #define DSFMT_PCV1 UINT64_C(0x00000001) #define DSFMT_PCV2 UINT64_C(0x00000000) */ #define DSFMT_LOW_MASK UINT64_C(0x000FFFFFFFFFFFFF) #define DSFMT_HIGH_CONST UINT64_C(0x3FF0000000000000) #define DSFMT_SR 12 /* for sse2 */ #if defined(HAVE_SSE2) #define SSE2_SHUFF 0x1b #elif defined(HAVE_ALTIVEC) #if defined(__APPLE__) /* For OSX */ #define ALTI_SR (vector unsigned char)(4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4) #define ALTI_SR_PERM \ (vector unsigned char)(15,0,1,2,3,4,5,6,15,8,9,10,11,12,13,14) #define ALTI_SR_MSK \ (vector unsigned int)(0x000fffffU,0xffffffffU,0x000fffffU,0xffffffffU) #define ALTI_PERM \ (vector unsigned char)(12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3) #else #define ALTI_SR {4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4} #define ALTI_SR_PERM {15,0,1,2,3,4,5,6,15,8,9,10,11,12,13,14} #define ALTI_SR_MSK {0x000fffffU,0xffffffffU,0x000fffffU,0xffffffffU} #define ALTI_PERM {12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3} #endif #endif #if DSFMT_MEXP == 521 #include "dSFMT-params521.h" #elif DSFMT_MEXP == 1279 #include "dSFMT-params1279.h" #elif DSFMT_MEXP == 2203 #include "dSFMT-params2203.h" #elif DSFMT_MEXP == 4253 #include "dSFMT-params4253.h" #elif DSFMT_MEXP == 11213 #include "dSFMT-params11213.h" #elif DSFMT_MEXP == 19937 #include "dSFMT-params19937.h" #elif DSFMT_MEXP == 44497 #include "dSFMT-params44497.h" #elif DSFMT_MEXP == 86243 #include "dSFMT-params86243.h" #elif DSFMT_MEXP == 132049 #include "dSFMT-params132049.h" #elif DSFMT_MEXP == 216091 #include "dSFMT-params216091.h" #else #ifdef __GNUC__ #error "DSFMT_MEXP is not valid." #undef DSFMT_MEXP #else #undef DSFMT_MEXP #endif #endif #endif /* DSFMT_PARAMS_H */ traildb-0.6+dfsg1/src/dsfmt/dSFMT-common.h0000600000175000017500000000663213106440271017563 0ustar czchenczchen#pragma once /** * @file dSFMT-common.h * * @brief SIMD oriented Fast Mersenne Twister(SFMT) pseudorandom * number generator with jump function. This file includes common functions * used in random number generation and jump. * * @author Mutsuo Saito (Hiroshima University) * @author Makoto Matsumoto (The University of Tokyo) * * Copyright (C) 2006, 2007 Mutsuo Saito, Makoto Matsumoto and Hiroshima * University. * Copyright (C) 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima * University and The University of Tokyo. * All rights reserved. * * The 3-clause BSD License is applied to this software, see * LICENSE.txt */ #ifndef DSFMT_COMMON_H #define DSFMT_COMMON_H #include "dSFMT.h" #if defined(HAVE_SSE2) # include union X128I_T { uint64_t u[2]; __m128i i128; }; union X128D_T { double d[2]; __m128d d128; }; /** mask data for sse2 */ static const union X128I_T sse2_param_mask = {{DSFMT_MSK1, DSFMT_MSK2}}; #endif #if defined(HAVE_ALTIVEC) inline static void do_recursion(w128_t *r, w128_t *a, w128_t * b, w128_t *lung) { const vector unsigned char sl1 = ALTI_SL1; const vector unsigned char sl1_perm = ALTI_SL1_PERM; const vector unsigned int sl1_msk = ALTI_SL1_MSK; const vector unsigned char sr1 = ALTI_SR; const vector unsigned char sr1_perm = ALTI_SR_PERM; const vector unsigned int sr1_msk = ALTI_SR_MSK; const vector unsigned char perm = ALTI_PERM; const vector unsigned int msk1 = ALTI_MSK; vector unsigned int w, x, y, z; z = a->s; w = lung->s; x = vec_perm(w, (vector unsigned int)perm, perm); y = vec_perm(z, (vector unsigned int)sl1_perm, sl1_perm); y = vec_sll(y, sl1); y = vec_and(y, sl1_msk); w = vec_xor(x, b->s); w = vec_xor(w, y); x = vec_perm(w, (vector unsigned int)sr1_perm, sr1_perm); x = vec_srl(x, sr1); x = vec_and(x, sr1_msk); y = vec_and(w, msk1); z = vec_xor(z, y); r->s = vec_xor(z, x); lung->s = w; } #elif defined(HAVE_SSE2) /** * This function represents the recursion formula. * @param r output 128-bit * @param a a 128-bit part of the internal state array * @param b a 128-bit part of the internal state array * @param d a 128-bit part of the internal state array (I/O) */ inline static void do_recursion(w128_t *r, w128_t *a, w128_t *b, w128_t *u) { __m128i v, w, x, y, z; x = a->si; z = _mm_slli_epi64(x, DSFMT_SL1); y = _mm_shuffle_epi32(u->si, SSE2_SHUFF); z = _mm_xor_si128(z, b->si); y = _mm_xor_si128(y, z); v = _mm_srli_epi64(y, DSFMT_SR); w = _mm_and_si128(y, sse2_param_mask.i128); v = _mm_xor_si128(v, x); v = _mm_xor_si128(v, w); r->si = v; u->si = y; } #else /** * This function represents the recursion formula. * @param r output 128-bit * @param a a 128-bit part of the internal state array * @param b a 128-bit part of the internal state array * @param lung a 128-bit part of the internal state array (I/O) */ inline static void do_recursion(w128_t *r, w128_t *a, w128_t * b, w128_t *lung) { uint64_t t0, t1, L0, L1; t0 = a->u[0]; t1 = a->u[1]; L0 = lung->u[0]; L1 = lung->u[1]; lung->u[0] = (t0 << DSFMT_SL1) ^ (L1 >> 32) ^ (L1 << 32) ^ b->u[0]; lung->u[1] = (t1 << DSFMT_SL1) ^ (L0 >> 32) ^ (L0 << 32) ^ b->u[1]; r->u[0] = (lung->u[0] >> DSFMT_SR) ^ (lung->u[0] & DSFMT_MSK1) ^ t0; r->u[1] = (lung->u[1] >> DSFMT_SR) ^ (lung->u[1] & DSFMT_MSK2) ^ t1; } #endif #endif traildb-0.6+dfsg1/src/dsfmt/dSFMT-params521.h0000600000175000017500000000265413106440271020006 0ustar czchenczchen#ifndef DSFMT_PARAMS521_H #define DSFMT_PARAMS521_H /* #define DSFMT_N 4 */ /* #define DSFMT_MAXDEGREE 544 */ #define DSFMT_POS1 3 #define DSFMT_SL1 25 #define DSFMT_MSK1 UINT64_C(0x000fbfefff77efff) #define DSFMT_MSK2 UINT64_C(0x000ffeebfbdfbfdf) #define DSFMT_MSK32_1 0x000fbfefU #define DSFMT_MSK32_2 0xff77efffU #define DSFMT_MSK32_3 0x000ffeebU #define DSFMT_MSK32_4 0xfbdfbfdfU #define DSFMT_FIX1 UINT64_C(0xcfb393d661638469) #define DSFMT_FIX2 UINT64_C(0xc166867883ae2adb) #define DSFMT_PCV1 UINT64_C(0xccaa588000000000) #define DSFMT_PCV2 UINT64_C(0x0000000000000001) #define DSFMT_IDSTR "dSFMT2-521:3-25:fbfefff77efff-ffeebfbdfbfdf" /* PARAMETERS FOR ALTIVEC */ #if defined(__APPLE__) /* For OSX */ #define ALTI_SL1 (vector unsigned char)(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) #define ALTI_SL1_PERM \ (vector unsigned char)(3,4,5,6,7,29,29,29,11,12,13,14,15,0,1,2) #define ALTI_SL1_MSK \ (vector unsigned int)(0xffffffffU,0xfe000000U,0xffffffffU,0xfe000000U) #define ALTI_MSK (vector unsigned int)(DSFMT_MSK32_1, \ DSFMT_MSK32_2, DSFMT_MSK32_3, DSFMT_MSK32_4) #else /* For OTHER OSs(Linux?) */ #define ALTI_SL1 {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} #define ALTI_SL1_PERM \ {3,4,5,6,7,29,29,29,11,12,13,14,15,0,1,2} #define ALTI_SL1_MSK \ {0xffffffffU,0xfe000000U,0xffffffffU,0xfe000000U} #define ALTI_MSK \ {DSFMT_MSK32_1, DSFMT_MSK32_2, DSFMT_MSK32_3, DSFMT_MSK32_4} #endif #endif /* DSFMT_PARAMS521_H */ traildb-0.6+dfsg1/src/dsfmt/dSFMT.h0000600000175000017500000005315413106440271016276 0ustar czchenczchen#pragma once /** * @file dSFMT.h * * @brief double precision SIMD oriented Fast Mersenne Twister(dSFMT) * pseudorandom number generator based on IEEE 754 format. * * @author Mutsuo Saito (Hiroshima University) * @author Makoto Matsumoto (Hiroshima University) * * Copyright (C) 2007, 2008 Mutsuo Saito, Makoto Matsumoto and * Hiroshima University. All rights reserved. * Copyright (C) 2012 Mutsuo Saito, Makoto Matsumoto, * Hiroshima University and The University of Tokyo. * All rights reserved. * * The new BSD License is applied to this software. * see LICENSE.txt * * @note We assume that your system has inttypes.h. If your system * doesn't have inttypes.h, you have to typedef uint32_t and uint64_t, * and you have to define PRIu64 and PRIx64 in this file as follows: * @verbatim typedef unsigned int uint32_t typedef unsigned long long uint64_t #define PRIu64 "llu" #define PRIx64 "llx" @endverbatim * uint32_t must be exactly 32-bit unsigned integer type (no more, no * less), and uint64_t must be exactly 64-bit unsigned integer type. * PRIu64 and PRIx64 are used for printf function to print 64-bit * unsigned int and 64-bit unsigned int in hexadecimal format. */ #ifndef DSFMT_H #define DSFMT_H #if defined(__cplusplus) extern "C" { #endif #include #include #if !defined(DSFMT_MEXP) #ifdef __GNUC__ #warning "DSFMT_MEXP is not defined. I assume DSFMT_MEXP is 19937." #endif #define DSFMT_MEXP 19937 #endif /*----------------- BASIC DEFINITIONS -----------------*/ /* Mersenne Exponent. The period of the sequence * is a multiple of 2^DSFMT_MEXP-1. * #define DSFMT_MEXP 19937 */ /** DSFMT generator has an internal state array of 128-bit integers, * and N is its size. */ #define DSFMT_N ((DSFMT_MEXP - 128) / 104 + 1) /** N32 is the size of internal state array when regarded as an array * of 32-bit integers.*/ #define DSFMT_N32 (DSFMT_N * 4) /** N64 is the size of internal state array when regarded as an array * of 64-bit integers.*/ #define DSFMT_N64 (DSFMT_N * 2) #if !defined(DSFMT_BIG_ENDIAN) # if defined(__BYTE_ORDER) && defined(__BIG_ENDIAN) # if __BYTE_ORDER == __BIG_ENDIAN # define DSFMT_BIG_ENDIAN 1 # endif # elif defined(_BYTE_ORDER) && defined(_BIG_ENDIAN) # if _BYTE_ORDER == _BIG_ENDIAN # define DSFMT_BIG_ENDIAN 1 # endif # elif defined(__BYTE_ORDER__) && defined(__BIG_ENDIAN__) # if __BYTE_ORDER__ == __BIG_ENDIAN__ # define DSFMT_BIG_ENDIAN 1 # endif # elif defined(BYTE_ORDER) && defined(BIG_ENDIAN) # if BYTE_ORDER == BIG_ENDIAN # define DSFMT_BIG_ENDIAN 1 # endif # elif defined(__BIG_ENDIAN) || defined(_BIG_ENDIAN) \ || defined(__BIG_ENDIAN__) || defined(BIG_ENDIAN) # define DSFMT_BIG_ENDIAN 1 # endif #endif #if defined(DSFMT_BIG_ENDIAN) && defined(__amd64) # undef DSFMT_BIG_ENDIAN #endif #if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) # include #elif defined(_MSC_VER) || defined(__BORLANDC__) # if !defined(DSFMT_UINT32_DEFINED) && !defined(SFMT_UINT32_DEFINED) typedef unsigned int uint32_t; typedef unsigned __int64 uint64_t; # ifndef UINT64_C # define UINT64_C(v) (v ## ui64) # endif # define DSFMT_UINT32_DEFINED # if !defined(inline) && !defined(__cplusplus) # define inline __inline # endif # endif #else # include # if !defined(inline) && !defined(__cplusplus) # if defined(__GNUC__) # define inline __inline__ # else # define inline # endif # endif #endif #ifndef PRIu64 # if defined(_MSC_VER) || defined(__BORLANDC__) # define PRIu64 "I64u" # define PRIx64 "I64x" # else # define PRIu64 "llu" # define PRIx64 "llx" # endif #endif #ifndef UINT64_C # define UINT64_C(v) (v ## ULL) #endif /*------------------------------------------ 128-bit SIMD like data type for standard C ------------------------------------------*/ #if defined(HAVE_ALTIVEC) # if !defined(__APPLE__) # include # endif /** 128-bit data structure */ union W128_T { vector unsigned int s; uint64_t u[2]; uint32_t u32[4]; double d[2]; }; #elif defined(HAVE_SSE2) # include /** 128-bit data structure */ union W128_T { __m128i si; __m128d sd; uint64_t u[2]; uint32_t u32[4]; double d[2]; }; #else /* standard C */ /** 128-bit data structure */ union W128_T { uint64_t u[2]; uint32_t u32[4]; double d[2]; }; #endif /** 128-bit data type */ typedef union W128_T w128_t; /** the 128-bit internal state array */ struct DSFMT_T { w128_t status[DSFMT_N + 1]; int idx; }; typedef struct DSFMT_T dsfmt_t; /** dsfmt internal state vector */ extern dsfmt_t dsfmt_global_data; /** dsfmt mexp for check */ extern const int dsfmt_global_mexp; void dsfmt_gen_rand_all(dsfmt_t *dsfmt); void dsfmt_fill_array_open_close(dsfmt_t *dsfmt, double array[], int size); void dsfmt_fill_array_close_open(dsfmt_t *dsfmt, double array[], int size); void dsfmt_fill_array_open_open(dsfmt_t *dsfmt, double array[], int size); void dsfmt_fill_array_close1_open2(dsfmt_t *dsfmt, double array[], int size); void dsfmt_chk_init_gen_rand(dsfmt_t *dsfmt, uint32_t seed, int mexp); void dsfmt_chk_init_by_array(dsfmt_t *dsfmt, uint32_t init_key[], int key_length, int mexp); const char *dsfmt_get_idstring(void); int dsfmt_get_min_array_size(void); #if defined(__GNUC__) # define DSFMT_PRE_INLINE inline static # define DSFMT_PST_INLINE __attribute__((always_inline)) #elif defined(_MSC_VER) && _MSC_VER >= 1200 # define DSFMT_PRE_INLINE __forceinline static # define DSFMT_PST_INLINE #else # define DSFMT_PRE_INLINE inline static # define DSFMT_PST_INLINE #endif DSFMT_PRE_INLINE uint32_t dsfmt_genrand_uint32(dsfmt_t *dsfmt) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_genrand_close1_open2(dsfmt_t *dsfmt) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_genrand_close_open(dsfmt_t *dsfmt) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_genrand_open_close(dsfmt_t *dsfmt) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_genrand_open_open(dsfmt_t *dsfmt) DSFMT_PST_INLINE; DSFMT_PRE_INLINE uint32_t dsfmt_gv_genrand_uint32(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_gv_genrand_close1_open2(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_gv_genrand_close_open(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_gv_genrand_open_close(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double dsfmt_gv_genrand_open_open(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_fill_array_open_close(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_fill_array_close_open(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_fill_array_open_open(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_fill_array_close1_open2(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_init_gen_rand(uint32_t seed) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_gv_init_by_array(uint32_t init_key[], int key_length) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_init_gen_rand(dsfmt_t *dsfmt, uint32_t seed) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void dsfmt_init_by_array(dsfmt_t *dsfmt, uint32_t init_key[], int key_length) DSFMT_PST_INLINE; /** * This function generates and returns unsigned 32-bit integer. * This is slower than SFMT, only for convenience usage. * dsfmt_init_gen_rand() or dsfmt_init_by_array() must be called * before this function. * @param dsfmt dsfmt internal state date * @return double precision floating point pseudorandom number */ inline static uint32_t dsfmt_genrand_uint32(dsfmt_t *dsfmt) { uint32_t r; uint64_t *psfmt64 = &dsfmt->status[0].u[0]; if (dsfmt->idx >= DSFMT_N64) { dsfmt_gen_rand_all(dsfmt); dsfmt->idx = 0; } r = psfmt64[dsfmt->idx++] & 0xffffffffU; return r; } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range [1, 2). This is * the primitive and faster than generating numbers in other ranges. * dsfmt_init_gen_rand() or dsfmt_init_by_array() must be called * before this function. * @param dsfmt dsfmt internal state date * @return double precision floating point pseudorandom number */ inline static double dsfmt_genrand_close1_open2(dsfmt_t *dsfmt) { double r; double *psfmt64 = &dsfmt->status[0].d[0]; if (dsfmt->idx >= DSFMT_N64) { dsfmt_gen_rand_all(dsfmt); dsfmt->idx = 0; } r = psfmt64[dsfmt->idx++]; return r; } /** * This function generates and returns unsigned 32-bit integer. * This is slower than SFMT, only for convenience usage. * dsfmt_gv_init_gen_rand() or dsfmt_gv_init_by_array() must be called * before this function. This function uses \b global variables. * @return double precision floating point pseudorandom number */ inline static uint32_t dsfmt_gv_genrand_uint32(void) { return dsfmt_genrand_uint32(&dsfmt_global_data); } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range [1, 2). * dsfmt_gv_init_gen_rand() or dsfmt_gv_init_by_array() must be called * before this function. This function uses \b global variables. * @return double precision floating point pseudorandom number */ inline static double dsfmt_gv_genrand_close1_open2(void) { return dsfmt_genrand_close1_open2(&dsfmt_global_data); } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range [0, 1). * dsfmt_init_gen_rand() or dsfmt_init_by_array() must be called * before this function. * @param dsfmt dsfmt internal state date * @return double precision floating point pseudorandom number */ inline static double dsfmt_genrand_close_open(dsfmt_t *dsfmt) { return dsfmt_genrand_close1_open2(dsfmt) - 1.0; } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range [0, 1). * dsfmt_gv_init_gen_rand() or dsfmt_gv_init_by_array() must be called * before this function. This function uses \b global variables. * @return double precision floating point pseudorandom number */ inline static double dsfmt_gv_genrand_close_open(void) { return dsfmt_gv_genrand_close1_open2() - 1.0; } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range (0, 1]. * dsfmt_init_gen_rand() or dsfmt_init_by_array() must be called * before this function. * @param dsfmt dsfmt internal state date * @return double precision floating point pseudorandom number */ inline static double dsfmt_genrand_open_close(dsfmt_t *dsfmt) { return 2.0 - dsfmt_genrand_close1_open2(dsfmt); } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range (0, 1]. * dsfmt_gv_init_gen_rand() or dsfmt_gv_init_by_array() must be called * before this function. This function uses \b global variables. * @return double precision floating point pseudorandom number */ inline static double dsfmt_gv_genrand_open_close(void) { return 2.0 - dsfmt_gv_genrand_close1_open2(); } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range (0, 1). * dsfmt_init_gen_rand() or dsfmt_init_by_array() must be called * before this function. * @param dsfmt dsfmt internal state date * @return double precision floating point pseudorandom number */ inline static double dsfmt_genrand_open_open(dsfmt_t *dsfmt) { double *dsfmt64 = &dsfmt->status[0].d[0]; union { double d; uint64_t u; } r; if (dsfmt->idx >= DSFMT_N64) { dsfmt_gen_rand_all(dsfmt); dsfmt->idx = 0; } r.d = dsfmt64[dsfmt->idx++]; r.u |= 1; return r.d - 1.0; } /** * This function generates and returns double precision pseudorandom * number which distributes uniformly in the range (0, 1). * dsfmt_gv_init_gen_rand() or dsfmt_gv_init_by_array() must be called * before this function. This function uses \b global variables. * @return double precision floating point pseudorandom number */ inline static double dsfmt_gv_genrand_open_open(void) { return dsfmt_genrand_open_open(&dsfmt_global_data); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range [1, 2) to the * specified array[] by one call. This function is the same as * dsfmt_fill_array_close1_open2() except that this function uses * \b global variables. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_fill_array_close1_open2() */ inline static void dsfmt_gv_fill_array_close1_open2(double array[], int size) { dsfmt_fill_array_close1_open2(&dsfmt_global_data, array, size); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range (0, 1] to the * specified array[] by one call. This function is the same as * dsfmt_gv_fill_array_close1_open2() except the distribution range. * This function uses \b global variables. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_fill_array_close1_open2() and \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void dsfmt_gv_fill_array_open_close(double array[], int size) { dsfmt_fill_array_open_close(&dsfmt_global_data, array, size); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range [0, 1) to the * specified array[] by one call. This function is the same as * dsfmt_gv_fill_array_close1_open2() except the distribution range. * This function uses \b global variables. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_fill_array_close1_open2() \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void dsfmt_gv_fill_array_close_open(double array[], int size) { dsfmt_fill_array_close_open(&dsfmt_global_data, array, size); } /** * This function generates double precision floating point * pseudorandom numbers which distribute in the range (0, 1) to the * specified array[] by one call. This function is the same as * dsfmt_gv_fill_array_close1_open2() except the distribution range. * This function uses \b global variables. * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_fill_array_close1_open2() \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void dsfmt_gv_fill_array_open_open(double array[], int size) { dsfmt_fill_array_open_open(&dsfmt_global_data, array, size); } /** * This function initializes the internal state array with a 32-bit * integer seed. * @param dsfmt dsfmt state vector. * @param seed a 32-bit integer used as the seed. */ inline static void dsfmt_init_gen_rand(dsfmt_t *dsfmt, uint32_t seed) { dsfmt_chk_init_gen_rand(dsfmt, seed, DSFMT_MEXP); } /** * This function initializes the internal state array with a 32-bit * integer seed. This function uses \b global variables. * @param seed a 32-bit integer used as the seed. * see also \sa dsfmt_init_gen_rand() */ inline static void dsfmt_gv_init_gen_rand(uint32_t seed) { dsfmt_init_gen_rand(&dsfmt_global_data, seed); } /** * This function initializes the internal state array, * with an array of 32-bit integers used as the seeds. * @param dsfmt dsfmt state vector * @param init_key the array of 32-bit integers, used as a seed. * @param key_length the length of init_key. */ inline static void dsfmt_init_by_array(dsfmt_t *dsfmt, uint32_t init_key[], int key_length) { dsfmt_chk_init_by_array(dsfmt, init_key, key_length, DSFMT_MEXP); } /** * This function initializes the internal state array, * with an array of 32-bit integers used as the seeds. * This function uses \b global variables. * @param init_key the array of 32-bit integers, used as a seed. * @param key_length the length of init_key. * see also \sa dsfmt_init_by_array() */ inline static void dsfmt_gv_init_by_array(uint32_t init_key[], int key_length) { dsfmt_init_by_array(&dsfmt_global_data, init_key, key_length); } #if !defined(DSFMT_DO_NOT_USE_OLD_NAMES) DSFMT_PRE_INLINE const char *get_idstring(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE int get_min_array_size(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void init_gen_rand(uint32_t seed) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void init_by_array(uint32_t init_key[], int key_length) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double genrand_close1_open2(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double genrand_close_open(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double genrand_open_close(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE double genrand_open_open(void) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void fill_array_open_close(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void fill_array_close_open(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void fill_array_open_open(double array[], int size) DSFMT_PST_INLINE; DSFMT_PRE_INLINE void fill_array_close1_open2(double array[], int size) DSFMT_PST_INLINE; /** * This function is just the same as dsfmt_get_idstring(). * @return id string. * see also \sa dsfmt_get_idstring() */ inline static const char *get_idstring(void) { return dsfmt_get_idstring(); } /** * This function is just the same as dsfmt_get_min_array_size(). * @return minimum size of array used for fill_array functions. * see also \sa dsfmt_get_min_array_size() */ inline static int get_min_array_size(void) { return dsfmt_get_min_array_size(); } /** * This function is just the same as dsfmt_gv_init_gen_rand(). * @param seed a 32-bit integer used as the seed. * see also \sa dsfmt_gv_init_gen_rand(), \sa dsfmt_init_gen_rand(). */ inline static void init_gen_rand(uint32_t seed) { dsfmt_gv_init_gen_rand(seed); } /** * This function is just the same as dsfmt_gv_init_by_array(). * @param init_key the array of 32-bit integers, used as a seed. * @param key_length the length of init_key. * see also \sa dsfmt_gv_init_by_array(), \sa dsfmt_init_by_array(). */ inline static void init_by_array(uint32_t init_key[], int key_length) { dsfmt_gv_init_by_array(init_key, key_length); } /** * This function is just the same as dsfmt_gv_genrand_close1_open2(). * @return double precision floating point number. * see also \sa dsfmt_genrand_close1_open2() \sa * dsfmt_gv_genrand_close1_open2() */ inline static double genrand_close1_open2(void) { return dsfmt_gv_genrand_close1_open2(); } /** * This function is just the same as dsfmt_gv_genrand_close_open(). * @return double precision floating point number. * see also \sa dsfmt_genrand_close_open() \sa * dsfmt_gv_genrand_close_open() */ inline static double genrand_close_open(void) { return dsfmt_gv_genrand_close_open(); } /** * This function is just the same as dsfmt_gv_genrand_open_close(). * @return double precision floating point number. * see also \sa dsfmt_genrand_open_close() \sa * dsfmt_gv_genrand_open_close() */ inline static double genrand_open_close(void) { return dsfmt_gv_genrand_open_close(); } /** * This function is just the same as dsfmt_gv_genrand_open_open(). * @return double precision floating point number. * see also \sa dsfmt_genrand_open_open() \sa * dsfmt_gv_genrand_open_open() */ inline static double genrand_open_open(void) { return dsfmt_gv_genrand_open_open(); } /** * This function is juset the same as dsfmt_gv_fill_array_open_close(). * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_gv_fill_array_open_close(), \sa * dsfmt_fill_array_close1_open2(), \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void fill_array_open_close(double array[], int size) { dsfmt_gv_fill_array_open_close(array, size); } /** * This function is juset the same as dsfmt_gv_fill_array_close_open(). * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_gv_fill_array_close_open(), \sa * dsfmt_fill_array_close1_open2(), \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void fill_array_close_open(double array[], int size) { dsfmt_gv_fill_array_close_open(array, size); } /** * This function is juset the same as dsfmt_gv_fill_array_open_open(). * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_gv_fill_array_open_open(), \sa * dsfmt_fill_array_close1_open2(), \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void fill_array_open_open(double array[], int size) { dsfmt_gv_fill_array_open_open(array, size); } /** * This function is juset the same as dsfmt_gv_fill_array_close1_open2(). * @param array an array where pseudorandom numbers are filled * by this function. * @param size the number of pseudorandom numbers to be generated. * see also \sa dsfmt_fill_array_close1_open2(), \sa * dsfmt_gv_fill_array_close1_open2() */ inline static void fill_array_close1_open2(double array[], int size) { dsfmt_gv_fill_array_close1_open2(array, size); } #endif /* DSFMT_DO_NOT_USE_OLD_NAMES */ #if defined(__cplusplus) } #endif #endif /* DSFMT_H */ traildb-0.6+dfsg1/src/tdb_profile.h0000600000175000017500000000157513106440271016535 0ustar czchenczchen#ifndef __TDB_PROFILE_H__ #define __TDB_PROFILE_H__ #include #include #include #ifdef TDB_PROFILE #define TDB_TIMER_DEF struct timeval __start; struct timeval __end; #define TDB_TIMER_START gettimeofday(&__start, NULL); #define TDB_TIMER_END(msg)\ gettimeofday(&__end, NULL);\ fprintf(stderr, "PROF: %s took %ldms\n", msg,\ ((__end.tv_sec * 1000000L + __end.tv_usec) -\ (__start.tv_sec * 1000000L + __start.tv_usec)) / 1000); #else #ifdef TDB_PROFILE_CPU #define TDB_TIMER_DEF clock_t __start; #define TDB_TIMER_START __start = clock(); #define TDB_TIMER_END(msg) fprintf(stderr, "PROF: %s took %2.4fms (CPU time)\n", msg,\ ((double) (clock() - __start)) / (CLOCKS_PER_SEC / 1000.0)); #else #define TDB_TIMER_DEF #define TDB_TIMER_START #define TDB_TIMER_END(x) #endif #endif #endif /* __TDB_PROFILE_H__ */ traildb-0.6+dfsg1/src/tdb_multi_cursor.c0000600000175000017500000002111413106440271017606 0ustar czchenczchen #include #include #include #include "traildb.h" #include "tdb_internal.h" #include "pqueue/pqueue.h" /* Multi-cursor merges events from K cursors (trails) in a single stream of timestamp ordered events on the fly. Merging is done with a simple priority queue (heap), implemented by the pqueue library. A key feature of multi-cursor is that it performs merging in a zero-copy fashion by relying on the event buffers of underlying cursors. A downside of this performance optimization is that the buffer/event lifetime needs to be managed carefully. If the buffer of an underlying cursor is exhausted temporarily, the cursor can't be refreshed right away since this would invalidate the past events. Instead, the cursor is marked as dirty in popped_node, so it can be reinstered when multi-cursor is called the next time. */ struct mcursor_node{ pqueue_pri_t timestamp; size_t pos; tdb_cursor *cursor; uint64_t index; }; struct tdb_multi_cursor{ /* priority queue */ pqueue_t *queue; struct mcursor_node *nodes; uint64_t num_nodes; /* node that needs to be reinserted on the next call */ struct mcursor_node *popped_node; /* returned event buffer */ tdb_multi_event current_event; }; /* pqueue callback functions */ static int cmp_pri(pqueue_pri_t next, pqueue_pri_t cur) { return next > cur; } static pqueue_pri_t get_pri(void *a) { return ((struct mcursor_node*)a)->timestamp; } static void set_pri(void *a, pqueue_pri_t timestamp) { ((struct mcursor_node*)a)->timestamp = timestamp; } static size_t get_pos(void *a) { return ((struct mcursor_node*)a)->pos; } static void set_pos(void *a, size_t pos) { ((struct mcursor_node*)a)->pos = pos; } static void print_entry(FILE *out, void *a) { struct mcursor_node *node = (struct mcursor_node*)a; fprintf(out, "node[%"PRIu64"] timestamp %"PRIu64"\n", node->index, (uint64_t)node->timestamp); } TDB_EXPORT tdb_multi_cursor *tdb_multi_cursor_new(tdb_cursor **cursors, uint64_t num_cursors) { tdb_multi_cursor *mc = NULL; uint64_t i; if (num_cursors > SIZE_MAX - 1) return NULL; if (!(mc = calloc(1, sizeof(struct tdb_multi_cursor)))) goto err; if (!(mc->nodes = calloc(num_cursors, sizeof(struct mcursor_node)))) goto err; if (!(mc->queue = pqueue_init(num_cursors, cmp_pri, get_pri, set_pri, get_pos, set_pos))) goto err; mc->num_nodes = num_cursors; for (i = 0; i < num_cursors; i++){ mc->nodes[i].cursor = cursors[i]; mc->nodes[i].index = i; } tdb_multi_cursor_reset(mc); return mc; err: tdb_multi_cursor_free(mc); return NULL; } /* Reinitialize the priority queue after the state of the underlying cursors has changed, e.g. after tdb_get_trail(). */ TDB_EXPORT void tdb_multi_cursor_reset(tdb_multi_cursor *mc) { uint64_t i; pqueue_reset(mc->queue); for (i = 0; i < mc->num_nodes; i++){ const tdb_event *event = tdb_cursor_peek(mc->nodes[i].cursor); if (event){ mc->nodes[i].timestamp = event->timestamp; /* we can ignore the return value of pqueue_insert since it won't need to call realloc() */ pqueue_insert(mc->queue, &mc->nodes[i]); } } mc->popped_node = NULL; } /* Reinsert exhausted cursor in the heap (see the top of this file for an explanation) */ static inline void reinsert_popped(tdb_multi_cursor *mc) { if (mc->popped_node){ const tdb_event *event = tdb_cursor_peek(mc->popped_node->cursor); if (event){ mc->popped_node->timestamp = event->timestamp; pqueue_insert(mc->queue, mc->popped_node); } mc->popped_node = NULL; } } /* Peek the next event to be returned */ TDB_EXPORT const tdb_multi_event *tdb_multi_cursor_peek(tdb_multi_cursor *mc) { const tdb_event *next_event; struct mcursor_node *node; reinsert_popped(mc); node = (struct mcursor_node*)pqueue_peek(mc->queue); if (!node) return NULL; mc->current_event.event = tdb_cursor_peek(node->cursor); mc->current_event.db = node->cursor->state->db; mc->current_event.cursor_idx = node->index; return &mc->current_event; } /* Return the next event */ TDB_EXPORT const tdb_multi_event *tdb_multi_cursor_next(tdb_multi_cursor *mc) { struct mcursor_node *node; reinsert_popped(mc); node = (struct mcursor_node*)pqueue_peek(mc->queue); if (!node) return NULL; mc->current_event.event = tdb_cursor_next(node->cursor); mc->current_event.db = node->cursor->state->db; mc->current_event.cursor_idx = node->index; if (node->cursor->num_events_left){ /* the event buffer of the cursor has remaining events, so we can just update the heap with the next timestamp */ const tdb_event *next_event = (const tdb_event*)node->cursor->next_event; pqueue_change_priority(mc->queue, next_event->timestamp, node); }else{ /* the event buffer of the cursor is empty. We don't know the next timestamp, so mark this cursor as dirty in popped_node (calling tdb_cursor_peek() would invalidate mc->current_event.event) */ pqueue_pop(mc->queue); mc->popped_node = node; } return &mc->current_event; } /* Return a batch of events. This is an optimized version of tdb_multi_cursor_next() */ TDB_EXPORT uint64_t tdb_multi_cursor_next_batch(tdb_multi_cursor *mc, tdb_multi_event *events, uint64_t max_events) { uint64_t n = 0; reinsert_popped(mc); /* next batch relies on the following heuristic: Often an individual cursor contains a long run of events whose timestamp is smaller than those of any other cursor, e.g. if there is a cursor for each daily traildb. Updating the heap for every single event is expensive and unnecessary in this case. It suffices to update the heap only when we switch cursors. This logic is implemented by popping the current cursor and peeking the next one: We can consume the current cursor as long as its timestamps are smaller than that of the next cursor. */ while (n < max_events){ struct mcursor_node *current = (struct mcursor_node*)pqueue_pop(mc->queue); struct mcursor_node *next = (struct mcursor_node*)pqueue_peek(mc->queue); tdb_cursor *cur; uint64_t next_timestamp = 0; int is_last = 0; if (current) cur = current->cursor; else /* heap is empty - all cursors exhausted */ break; if (next) /* consume the current cursor while timestamps are smaller than next_timestamp */ next_timestamp = next->timestamp; else /* no next timestamp - current is the last one. We can consume current until its end */ is_last = 1; while (1){ if (cur->num_events_left){ /* there are events left in the buffer */ const tdb_event *event = (const tdb_event*)cur->next_event; if (n < max_events && (is_last || event->timestamp <= next_timestamp)){ events[n].event = event; events[n].db = cur->state->db; events[n].cursor_idx = current->index; ++n; tdb_cursor_next(cur); }else{ /* update the timestamp of the current cursor and return it to the heap */ current->timestamp = event->timestamp; pqueue_insert(mc->queue, current); break; } }else{ /* no events left in the buffer, we must stop iterating to avoid the previous events from becoming invalid */ mc->popped_node = current; goto done; } } } done: return n; } TDB_EXPORT void tdb_multi_cursor_free(tdb_multi_cursor *mc) { if (mc){ if (mc->queue) pqueue_free(mc->queue); free(mc->nodes); free(mc); } } traildb-0.6+dfsg1/src/tdb_types.h0000600000175000017500000000630113106440271016231 0ustar czchenczchen #ifndef __TDB_TYPES_H__ #define __TDB_TYPES_H__ #include #include "tdb_limits.h" /* Internally we deal with ids: (uint64_t) trail_id -> (16 byte) uuid (uint32_t) field -> (0-terminated str) field_name (uint64_t) val -> (bytes) value The complete picture looks like: uuid -> trail_id trail_id -> [event, ...] event := [timestamp, item, ...] item := (field, val) field -> field_name val -> value There are two types of tdb_items, narrow (32-bit) and wide (64-bit): Narrow item (32 bit): [ field | wide-flag | val ] 7 1 24 Wide item (64 bit): [ field | wide-flag | ext-field | ext-flag | val | reserved ] 7 1 7 1 40 8 'ext-flag' and 'reserved' are reserved for future needs. Note that timestamp items may use 47 bits, i.e. they have only 7 bits reserved space. */ typedef uint32_t tdb_field; typedef uint64_t tdb_val; typedef uint64_t tdb_item; typedef struct _tdb_cons tdb_cons; typedef struct _tdb tdb; typedef struct __attribute__((packed)){ uint64_t timestamp; uint64_t num_items; const tdb_item items[0]; } tdb_event; typedef struct{ const tdb *db; const tdb_event *event; uint64_t cursor_idx; } tdb_multi_event; typedef struct{ struct tdb_decode_state *state; const char *next_event; uint64_t num_events_left; } tdb_cursor; typedef struct tdb_multi_cursor tdb_multi_cursor; #define tdb_item_field32(item) (item & 127) #define tdb_item_val32(item) ((item >> 8) & UINT32_MAX) #define tdb_item_is32(item) (!(item & 128)) static inline tdb_field tdb_item_field(tdb_item item) { if (tdb_item_is32(item)) return (tdb_field)tdb_item_field32(item); else return (tdb_field)((item & 127) | (((item >> 8) & 127) << 7)); } static inline tdb_val tdb_item_val(tdb_item item) { if (tdb_item_is32(item)) return (tdb_val)tdb_item_val32(item); else return (tdb_val)(item >> 16); } static inline tdb_item tdb_make_item(tdb_field field, tdb_val val) { /* here we assume that val < 2^48 and field < 2^14 */ if (field > TDB_FIELD32_MAX || val > TDB_VAL32_MAX){ const uint64_t field1 = field & 127; const uint64_t field2 = (field >> 7) << 8; return field1 | 128 | field2 | (val << 16); }else return field | (val << 8); } typedef enum{ /* reading */ TDB_OPT_ONLY_DIFF_ITEMS = 100, TDB_OPT_EVENT_FILTER = 101, TDB_OPT_CURSOR_EVENT_BUFFER_SIZE = 102, /* writing */ TDB_OPT_CONS_OUTPUT_FORMAT = 1001, TDB_OPT_CONS_NO_BIGRAMS = 1002, } tdb_opt_key; typedef union{ const void *ptr; uint64_t value; } tdb_opt_value; static const tdb_opt_value TDB_TRUE __attribute__((unused)) = {.value = 1}; static const tdb_opt_value TDB_FALSE __attribute__((unused)) = {.value = 0}; #define opt_val(x) ((tdb_opt_value){.value = x}) #define TDB_OPT_CONS_OUTPUT_FORMAT_DIR 0 #define TDB_OPT_CONS_OUTPUT_FORMAT_PACKAGE 1 typedef enum { TDB_EVENT_FILTER_UNKNOWN_TERM = 0, TDB_EVENT_FILTER_MATCH_TERM = 1, TDB_EVENT_FILTER_TIME_RANGE_TERM = 2 } tdb_event_filter_term_type; #endif /* __TDB_TYPES_H__ */ traildb-0.6+dfsg1/src/tdb_uuid.c0000600000175000017500000000661713106440271016040 0ustar czchenczchen #include "traildb.h" #define TDB_EXPORT __attribute__((visibility("default"))) static const uint8_t HEXBYTES[] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // !"#$%&' 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ()*+,-./ 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, // 01234567 0x09, 0x0a, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // 89:;<=>? 0x00, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x00, // @ABCDEFG 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // HIJKLMNO 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // PQRSTUVW 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // XYZ[\]^_ 0x00, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x00, // `abcdefg 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // hijklmno 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // pqrstuvw 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // xyz{|}~. 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, // ........ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 // ........ }; static const uint8_t HEXCHARS[] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f" "202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f" "404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f" "606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f" "808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f" "a0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebf" "c0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedf" "e0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"; TDB_EXPORT int tdb_uuid_raw(const uint8_t hexuuid[32], uint8_t uuid[16]) { int i; for (i = 0; i < 32; i += 2){ uint8_t c1 = HEXBYTES[hexuuid[i]]; uint8_t c2 = HEXBYTES[hexuuid[i + 1]]; if (c1 && c2) uuid[i / 2] = (uint8_t)(((c1 - 1) << 4) | (c2 - 1)); else return TDB_ERR_INVALID_UUID; } return 0; } TDB_EXPORT void tdb_uuid_hex(const uint8_t uuid[16], uint8_t hexuuid[32]) { int i; for (i = 0; i < 16; i++){ hexuuid[i * 2] = HEXCHARS[uuid[i] * 2]; hexuuid[i * 2 + 1] = HEXCHARS[uuid[i] * 2 + 1]; } } traildb-0.6+dfsg1/src/tdb_cons_package.c0000600000175000017500000003200213106440271017472 0ustar czchenczchen#ifdef HAVE_ARCHIVE_H #define _DEFAULT_SOURCE /* mkstemp() */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include "tdb_package.h" /* NOTE! DO NOT change HEADER_FILES since we guarantee that these files (and TOC_FILE) can be found at fixed offsets */ static const char *HEADER_FILES[] = {"version", /* TODO add "id" (id should have a magic prefix) */ "info"}; static const char *DATA_FILES[] = {"fields", "trails.codebook", "trails.toc", "trails.data", "uuids"}; static const char TOC_FILE[] = "tar.toc"; static inline void debug_print(char __attribute__((unused)) *fmt, ...) { #ifdef TDB_PACKAGE_DEBUG va_list aptr; va_start(aptr, fmt); vfprintf(stderr, fmt, aptr); va_end(aptr); #endif } static inline tdb_error write_toc_entry(FILE *toc_file, const char *fname, uint64_t offset, uint64_t size) { int ret = 0; TDB_FPRINTF(toc_file, "%s %"PRIu64" %"PRIu64"\n", fname, offset, size); done: return ret; } static tdb_error write_header(struct archive *tar, struct archive_entry *entry, const char *fname, uint64_t size) { if (size > INT64_MAX) return TDB_ERR_IO_PACKAGE; archive_entry_clear(entry); archive_entry_set_pathname(entry, fname); archive_entry_set_size(entry, (int64_t)size); archive_entry_set_filetype(entry, AE_IFREG); archive_entry_set_perm(entry, 0644); if (archive_write_header(tar, entry) != ARCHIVE_OK) return TDB_ERR_IO_PACKAGE; return 0; } static tdb_error write_file_entry(struct archive *tar, struct archive_entry *entry, const char *src, const char *root, FILE *toc_file) { const uint64_t BUFFER_SIZE = 65536; char buffer[BUFFER_SIZE]; char path[TDB_MAX_PATH_SIZE]; struct stat stats; int fd = 0; int ret = 0; uint64_t num_left; TDB_PATH(path, "%s/%s", root, src); if ((fd = open(path, O_RDONLY)) == -1){ debug_print("opening source file %s failed\n", path); ret = TDB_ERR_IO_PACKAGE; goto done; } if (fstat(fd, &stats)){ debug_print("fstat on source file %s failed\n", path); ret = TDB_ERR_IO_PACKAGE; goto done; } if ((ret = write_header(tar, entry, src, (uint64_t)stats.st_size))){ debug_print("write_header for %s failed\n", src); goto done; } if ((ret = write_toc_entry(toc_file, src, (uint64_t)archive_filter_bytes(tar, -1), (uint64_t)stats.st_size))){ debug_print("write_toc_entry for source file %s failed\n", src); goto done; } for (num_left = (uint64_t)stats.st_size; num_left > 0;){ ssize_t r = read(fd, buffer, BUFFER_SIZE); if (r < 1){ debug_print("reading source file %s failed\n", path); ret = TDB_ERR_IO_PACKAGE; goto done; } if (archive_write_data(tar, buffer, (size_t)r) != r){ debug_print("writing file %s failed\n", src); ret = TDB_ERR_IO_PACKAGE; goto done; } num_left -= (uint64_t)r; } /* once the file has been successfully appended to the archive, we delete the source to save disk space */ if (unlink(path)){ ret = TDB_ERR_IO_PACKAGE; debug_print("unlinking %s failed\n", path); goto done; } done: if (fd) close(fd); return ret; } static tdb_error write_entries(struct archive *tar, struct archive_entry *entry, const char **files, uint64_t num_files, const tdb_cons *cons, FILE *toc_file) { uint64_t i; int ret = 0; for (i = 0; i < num_files; i++) if ((ret = write_file_entry(tar, entry, files[i], cons->root, toc_file))) goto done; done: return ret; } static tdb_error init_tar_toc(struct archive *tar, struct archive_entry *entry, const tdb_cons *cons, FILE *toc_file, uint64_t *toc_offset, uint64_t *toc_max_size) { /* We need to preallocate enough space for tar.toc file. Since we don't know the exact offsets yet, we allocate the absolute maximum the toc can take. */ /* VALUE_SIZE = len(' %d %d\n' % (2**64, 2**64)) */ static const uint64_t VALUE_SIZE = 43; static const uint64_t LEXICON_PREFIX_LEN = 8; /* = len("lexicon.") */ uint64_t i, size = strlen(TOC_FILE) + VALUE_SIZE + strlen(TDB_TAR_MAGIC); char *buffer = NULL; int ret = 0; for (i = 0; i < sizeof(HEADER_FILES) / sizeof(HEADER_FILES[0]); i++) size += strlen(HEADER_FILES[i]) + VALUE_SIZE; for (i = 0; i < sizeof(DATA_FILES) / sizeof(DATA_FILES[0]); i++) size += strlen(DATA_FILES[i]) + VALUE_SIZE; for (i = 0; i < cons->num_ofields; i++) size += strlen(cons->ofield_names[i]) + LEXICON_PREFIX_LEN + VALUE_SIZE; *toc_max_size = ++size; /* empty line in the end */ if ((ret = write_header(tar, entry, TOC_FILE, size))){ debug_print("write_header for TOC_FILE failed\n"); goto done; } /* archives are unreadable if the TOC is not found exactly at the right offset. Assert that this requirement is not violated. */ *toc_offset = (uint64_t)archive_filter_bytes(tar, -1); if (*toc_offset != TOC_FILE_OFFSET){ debug_print("assert failed: invalid toc offset: %"PRIu64"\n", *toc_offset); ret = TDB_ERR_IO_PACKAGE; goto done; } if ((ret = write_toc_entry(toc_file, TOC_FILE, *toc_offset, size))){ debug_print("write_toc_entry for TOC_FILE failed\n"); goto done; } /* we just need an array of null bytes */ if (!(buffer = calloc(1, size))){ ret = TDB_ERR_NOMEM; goto done; } if (archive_write_data(tar, buffer, size) != (ssize_t)size){ debug_print("reserving %"PRIu64" bytes for TOC_FILE failed\n", size); ret = TDB_ERR_IO_PACKAGE; goto done; } done: free(buffer); return ret; } static tdb_error write_lexicons(struct archive *tar, struct archive_entry *entry, const tdb_cons *cons, FILE *toc_file) { char path[TDB_MAX_PATH_SIZE]; uint64_t i; int ret = 0; for (i = 0; i < cons->num_ofields; i++){ TDB_PATH(path, "lexicon.%s", cons->ofield_names[i]); if ((ret = write_file_entry(tar, entry, path, cons->root, toc_file))) goto done; } done: return ret; } static tdb_error write_tar_toc(int fd, FILE *toc_file, uint64_t toc_offset, uint64_t toc_max_size) { /* we rewind back to the position of the TOC_FILE and actually fill in the contents of the file */ const uint64_t BUFFER_SIZE = 65536; char buffer[BUFFER_SIZE]; uint64_t n, i, toc_size; int ret = 0; long offset; /* find an empty line denoting EOF in toc_file */ TDB_FPRINTF(toc_file, "\n"); if ((offset = ftell(toc_file)) == -1){ debug_print("ftell(toc_file) failed\n"); ret = TDB_ERR_IO_PACKAGE; goto done; } toc_size = (uint64_t)offset; /* assert that our max_size estimate is not broken */ if (toc_size > toc_max_size){ debug_print("assert failed: toc_size %"PRIu64" > %"PRIu64"\n", toc_size, toc_max_size); ret = TDB_ERR_IO_PACKAGE; goto done; } rewind(toc_file); if (lseek(fd, (off_t)toc_offset, SEEK_SET) != (off_t)toc_offset){ debug_print("lseek(fd) failed\n"); ret = TDB_ERR_IO_PACKAGE; goto done; } for (n = 0; n < toc_size;){ size_t r = fread(buffer, 1, BUFFER_SIZE, toc_file); if (r < 1){ debug_print("fread() toc_file failed\n"); ret = TDB_ERR_IO_READ; goto done; } for (i = 0; i < r;){ ssize_t w = write(fd, &buffer[i], r - i); if (w < 1){ debug_print("write(fd) failed\n"); ret = TDB_ERR_IO_PACKAGE; goto done; } i += (uint64_t)w; } n += r; } done: return ret; } tdb_error cons_package(const tdb_cons *cons) { char dst_path[TDB_MAX_PATH_SIZE]; char path[TDB_MAX_PATH_SIZE]; struct archive *tar = NULL; int fd = 0; int ret = 0; FILE *toc_file; struct archive_entry *entry = archive_entry_new(); uint64_t toc_offset = 0; uint64_t toc_max_size = 0; if (!entry) return TDB_ERR_NOMEM; /* 1) open archive */ if (!(tar = archive_write_new())){ debug_print("archive_write_new() failed\n"); ret = TDB_ERR_NOMEM; goto done; } if (archive_write_set_format_gnutar(tar) != ARCHIVE_OK){ debug_print("archive_write_set_format_gnutar() failed\n"); ret = TDB_ERR_IO_PACKAGE; goto done; } TDB_PATH(dst_path, "%s.tdb.XXXXXX", cons->root); if ((fd = mkstemp(dst_path)) == -1){ debug_print("mkstemp(%s) failed\n", dst_path); ret = TDB_ERR_IO_PACKAGE; goto done; } if (archive_write_open_fd(tar, fd) != ARCHIVE_OK){ debug_print("archive_write_open_fd(%s) failed\n", dst_path); ret = TDB_ERR_IO_PACKAGE; goto done; } /* open TOC_FILE */ TDB_PATH(path, "%s/%s", cons->root, TOC_FILE); TDB_OPEN(toc_file, path, "w+"); TDB_FPRINTF(toc_file, TDB_TAR_MAGIC); /* 2) write header files */ if ((ret = write_entries(tar, entry, HEADER_FILES, sizeof(HEADER_FILES) / sizeof(HEADER_FILES[0]), cons, toc_file))) goto done; /* 3) write tar toc */ if ((ret = init_tar_toc(tar, entry, cons, toc_file, &toc_offset, &toc_max_size))) goto done; /* 4) write lexicons */ if ((ret = write_lexicons(tar, entry, cons, toc_file))) goto done; /* 5) write data */ if ((ret = write_entries(tar, entry, DATA_FILES, sizeof(DATA_FILES) / sizeof(DATA_FILES[0]), cons, toc_file))) goto done; /* 6) finalize archive */ if (archive_write_free(tar) != ARCHIVE_OK){ ret = TDB_ERR_IO_PACKAGE; /* don't call archive_write_free twice */ tar = NULL; goto done; } tar = NULL; /* 7) write toc */ if ((ret = write_tar_toc(fd, toc_file, toc_offset, toc_max_size))) goto done; /* delete TOC_FILE */ fclose(toc_file); unlink(path); /* fsync() is required to ensure integrity of the package */ if (fsync(fd)){ debug_print("fsync failed\n"); ret = TDB_ERR_IO_CLOSE; goto done; } if (close(fd)){ debug_print("close failed\n"); ret = TDB_ERR_IO_CLOSE; /* never call close() twice, even if it fails */ fd = 0; goto done; } fd = 0; /* 9) rename archive */ TDB_PATH(path, "%s.tdb", cons->root); if (rename(dst_path, path)){ debug_print("rename to %s -> %s failed\n", dst_path, path); ret = TDB_ERR_IO_CLOSE; goto done; } /* rmdir() failing (most often because the directory is not empty), is not considered a fatal error: This can happen e.g. if the directory contains remnants of a previously failed tdb_cons, which is harmless */ if (rmdir(cons->root)) debug_print("rmdir(%s) failed\n", cons->root); done: archive_entry_free(entry); if (fd) close(fd); if (tar){ if (archive_write_free(tar) != ARCHIVE_OK) ret = TDB_ERR_IO_PACKAGE; } return ret; } #endif /* HAVE_ARCHIVE_H */ traildb-0.6+dfsg1/src/tdb_queue.c0000600000175000017500000000232413106440271016205 0ustar czchenczchen#include #include #include #include "tdb_queue.h" struct tdb_queue{ void **q; uint32_t max; uint32_t head; uint32_t tail; uint32_t count; }; struct tdb_queue *tdb_queue_new(uint32_t max_length) { struct tdb_queue *q = NULL; if (!max_length) return NULL; if (!(q = malloc(sizeof(struct tdb_queue)))) return NULL; if (!(q->q = malloc(max_length * sizeof(void*)))) return NULL; q->max = max_length; q->head = q->tail = q->count = 0; return q; } void tdb_queue_free(struct tdb_queue *q) { free(q->q); free(q); } void tdb_queue_push(struct tdb_queue *q, void *e) { if (q->max == q->count++){ fprintf(stderr, "tdb_queue_push: max=%d, count=%d " "(this should never happen!)", q->max, q->count); abort(); } q->q[q->head++ % q->max] = e; } void *tdb_queue_pop(struct tdb_queue *q) { if (!q->count) return NULL; --q->count; return q->q[q->tail++ % q->max]; } uint32_t tdb_queue_length(const struct tdb_queue *q) { return q->count; } void *tdb_queue_peek(const struct tdb_queue *q) { if (!q->count) return NULL; return q->q[q->tail % q->max]; } traildb-0.6+dfsg1/src/tdb_decode.c0000600000175000017500000002533713106440271016315 0ustar czchenczchen#include "tdb_internal.h" #include "tdb_huffman.h" #define CURSOR_FILTER 1 #define TRAIL_FILTER 2 static inline uint64_t tdb_get_trail_offs(const tdb *db, uint64_t trail_id) { if (db->trails.size < UINT32_MAX) return ((const uint32_t*)db->toc.data)[trail_id]; else return ((const uint64_t*)db->toc.data)[trail_id]; } static int event_satisfies_filter(const tdb_item *event, uint64_t timestamp, const tdb_item *filter, uint64_t filter_len) { uint64_t i = 0; while (i < filter_len){ uint64_t clause_len = filter[i++]; uint64_t next_clause = i + clause_len; int match = 0; if (next_clause > filter_len) return 0; while (i < next_clause){ uint64_t op_flags = filter[i++]; uint64_t filter_item = filter[i++]; /* Time range queries */ if (op_flags & TDB_EVENT_TIME_RANGE) { uint64_t end_filter = filter[i++]; if (filter_item <= timestamp && timestamp < end_filter) { match = 1; break; } } else { /* Item-matching queries */ uint64_t is_negative = op_flags & TDB_EVENT_NEGATED; tdb_field field = tdb_item_field(filter_item); if (field){ if ((event[field] == filter_item) != is_negative){ match = 1; break; } } else { if (is_negative) { match = 1; break; } } } } if (!match){ return 0; } i = next_clause; } return 1; } TDB_EXPORT tdb_cursor *tdb_cursor_new(const tdb *db) { tdb_cursor *c = NULL; if (!(c = calloc(1, sizeof(tdb_cursor)))) goto err; if (!(c->state = calloc(1, sizeof(struct tdb_decode_state) + db->num_fields * sizeof(tdb_item)))) goto err; c->state->db = db; c->state->edge_encoded = db->opt_edge_encoded; c->state->events_buffer_len = db->opt_cursor_event_buffer_size; /* set the filter type to TRAIL_FILTER initially so it can be overriden with the right value in tdb_get_trail() */ c->state->filter_type = TRAIL_FILTER; if (!(c->state->events_buffer = calloc(c->state->events_buffer_len, (db->num_fields + 1) * sizeof(tdb_item)))) goto err; return c; err: tdb_cursor_free(c); return NULL; } TDB_EXPORT void tdb_cursor_free(tdb_cursor *c) { if (c){ free(c->state->events_buffer); free(c->state); free(c); } } TDB_EXPORT void tdb_cursor_unset_event_filter(tdb_cursor *cursor) { cursor->state->filter = NULL; cursor->state->filter_type = TRAIL_FILTER; } TDB_EXPORT tdb_error tdb_cursor_set_event_filter(tdb_cursor *cursor, const struct tdb_event_filter *filter) { if (cursor->state->edge_encoded) return TDB_ERR_ONLY_DIFF_FILTER; else{ cursor->state->filter = filter; cursor->state->filter_type = CURSOR_FILTER; return TDB_ERR_OK; } } TDB_EXPORT tdb_error tdb_get_trail(tdb_cursor *cursor, uint64_t trail_id) { struct tdb_decode_state *s = cursor->state; const tdb *db = s->db; tdb_error err = 0; if (trail_id < db->num_trails){ /* initialize cursor for a new trail */ uint64_t trail_size; tdb_field field; /* db->opt_event_filter may have changed since the last tdb_get_trail call, so we will always reset it. Also we need to reset any trail-level filter that may have been set previously. */ if (s->filter_type == TRAIL_FILTER){ if (db->opt_event_filter){ /* apply a db-level filter, may be overriden by a trail-level below */ if (s->edge_encoded){ /* setting a filter in the edge-encoded mode fails as in tdb_cursor_set_event_filter above */ err = TDB_ERR_ONLY_DIFF_FILTER; goto done; }else s->filter = db->opt_event_filter; }else s->filter = NULL; } /* we can apply a trail-level filter only if trail-level filters exist AND a cursor-level filter wasn't set */ if (db->opt_trail_event_filters && s->filter_type != CURSOR_FILTER){ Word_t *ptr; JLG(ptr, db->opt_trail_event_filters, trail_id); if (ptr){ if (s->edge_encoded){ /* setting a filter in the edge-encoded mode fails as in tdb_cursor_set_event_filter above */ err = TDB_ERR_ONLY_DIFF_FILTER; goto done; }else{ s->filter = (const struct tdb_event_filter*)*ptr; s->filter_type = TRAIL_FILTER; } } } if (s->filter && (s->filter->options & TDB_FILTER_MATCH_NONE)){ /* no need to evaluate anything if the filter matches nothing */ err = 0; goto done; }else{ /* edge encoding: some fields may be inherited from previous events. Keep track what we have seen in the past. Start with NULL values. */ for (field = 1; field < db->num_fields; field++) s->previous_items[field] = tdb_make_item(field, 0); s->data = &db->trails.data[tdb_get_trail_offs(db, trail_id)]; trail_size = tdb_get_trail_offs(db, trail_id + 1) - tdb_get_trail_offs(db, trail_id); s->size = 8 * trail_size - read_bits(s->data, 0, 3); s->offset = 3; s->tstamp = db->min_timestamp; s->trail_id = trail_id; cursor->num_events_left = 0; cursor->next_event = s->events_buffer; return 0; } }else err = TDB_ERR_INVALID_TRAIL_ID; done: cursor->num_events_left = 0; cursor->next_event = NULL; s->size = 0; s->offset = 0; return err; } TDB_EXPORT uint64_t tdb_get_trail_length(tdb_cursor *cursor) { uint64_t count = 0; while (_tdb_cursor_next_batch(cursor)) count += cursor->num_events_left; return count; } TDB_EXPORT int _tdb_cursor_next_batch(tdb_cursor *cursor) { struct tdb_decode_state *s = cursor->state; const struct huff_codebook *codebook = (const struct huff_codebook*)s->db->codebook.data; const struct field_stats *fstats = s->db->field_stats; uint64_t *dst = (uint64_t*)s->events_buffer; uint64_t i = 0; uint64_t num_events = 0; tdb_field field; tdb_item item; const int edge_encoded = s->edge_encoded; /* decode the trail - exit early if destination buffer runs out of space */ while (s->offset < s->size && num_events < s->events_buffer_len){ /* Every event starts with a timestamp. Timestamp may be the first member of a bigram */ __uint128_t gram = huff_decode_value(codebook, s->data, &s->offset, fstats); uint64_t orig_i = i; uint64_t delta = tdb_item_val(HUFF_BIGRAM_TO_ITEM(gram)); uint64_t *num_items; /* events buffer format: [ [ timestamp | num_items | items ... ] tdb_event 1 [ timestamp | num_items | items ... ] tdb_event 2 ... [ timestamp | num_items | items ... ] tdb_event N ] note that events may have a varying number of items, due to edge encoding */ s->tstamp += delta; dst[i++] = s->tstamp; num_items = &dst[i++]; item = HUFF_BIGRAM_OTHER_ITEM(gram); /* handle a possible latter part of the first bigram */ if (item){ field = tdb_item_field(item); s->previous_items[field] = item; if (edge_encoded) dst[i++] = item; } /* decode one event: timestamp is followed by at most num_ofields field values */ while (s->offset < s->size){ uint64_t prev_offs = s->offset; gram = huff_decode_value(codebook, s->data, &s->offset, fstats); item = HUFF_BIGRAM_TO_ITEM(gram); field = tdb_item_field(item); if (field){ /* value may be either a unigram or a bigram */ do{ s->previous_items[field] = item; if (edge_encoded) dst[i++] = item; gram = item = HUFF_BIGRAM_OTHER_ITEM(gram); }while ((field = tdb_item_field(item))); }else{ /* we hit the next timestamp, take a step back and break */ s->offset = prev_offs; break; } } if (!s->filter || (s->filter->options & TDB_FILTER_MATCH_ALL) || event_satisfies_filter(s->previous_items, s->tstamp, s->filter->items, s->filter->count)){ /* no filter or filter matches, finalize the event */ if (!edge_encoded){ /* dump all the fields of this event in the result, if edge encoding is not requested */ for (field = 1; field < s->db->num_fields; field++) dst[i++] = s->previous_items[field]; } ++num_events; *num_items = (i - (orig_i + 2)); }else{ /* filter doesn't match - ignore this event */ i = orig_i; } } cursor->next_event = s->events_buffer; cursor->num_events_left = num_events; return num_events > 0 ? 1: 0; } /* the following ensures that tdb_cursor_next() is exported to libtraildb.so this is "strategy 3" from http://www.greenend.org.uk/rjk/tech/inline.html */ TDB_EXPORT extern const tdb_event *tdb_cursor_next(tdb_cursor *cursor); TDB_EXPORT extern const tdb_event *tdb_cursor_peek(tdb_cursor *cursor); traildb-0.6+dfsg1/src/tdb_package.c0000600000175000017500000001154513106440271016461 0ustar czchenczchen#define _DEFAULT_SOURCE /* getline() */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include "tdb_package.h" struct pkg_toc{ char *fname; uint64_t offset; uint64_t size; }; static uint64_t toc_count_lines(FILE *f) { char *buf = NULL; size_t n = 0; uint64_t num_lines = 0; if (getline(&buf, &n, f) == -1) goto done; if (strcmp(buf, TDB_TAR_MAGIC)) goto done; while (1){ if (getline(&buf, &n, f) == -1){ num_lines = 0; goto done; } if (buf[0] == '\n') break; ++num_lines; } done: free(buf); return num_lines; } static tdb_error toc_parse(FILE *f, struct pkg_toc *toc, uint64_t num_lines) { char *buf = NULL; size_t n = 0; uint64_t i; char *saveptr = NULL; /* ignore magic line */ if (getline(&buf, &n, f) == -1) return TDB_ERR_IO_READ; for (i = 0; i < num_lines; i++){ if (getline(&buf, &n, f) == -1) return TDB_ERR_IO_READ; char *tok = strtok_r(buf, " ", &saveptr); if (tok == NULL) { return TDB_ERR_INVALID_PACKAGE; } toc[i].fname = strdup(tok); tok = strtok_r(NULL, " ", &saveptr); if (tok == NULL) { return TDB_ERR_INVALID_PACKAGE; } toc[i].offset = strtoull(tok, NULL, 10); tok = strtok_r(NULL, " ", &saveptr); if (tok == NULL) { return TDB_ERR_INVALID_PACKAGE; } toc[i].size = strtoull(tok, NULL, 10); } free(buf); return 0; } tdb_error open_package(tdb *db, const char *root) { int ret = 0; uint64_t num_lines; TDB_OPEN(db->package_handle, root, "r"); if (fseek(db->package_handle, TOC_FILE_OFFSET, SEEK_SET) == -1){ ret = TDB_ERR_INVALID_PACKAGE; goto done; } if (!(num_lines = toc_count_lines(db->package_handle))){ ret = TDB_ERR_INVALID_PACKAGE; goto done; } if (!(db->package_toc = calloc(num_lines + 1, sizeof(struct pkg_toc)))){ ret = TDB_ERR_NOMEM; goto done; } if (fseek(db->package_handle, TOC_FILE_OFFSET, SEEK_SET) == -1){ ret = TDB_ERR_INVALID_PACKAGE; goto done; } if ((ret = toc_parse(db->package_handle, db->package_toc, num_lines))) goto done; done: return ret; } void free_package(tdb *db) { if (db->package_toc){ struct pkg_toc *toc = (struct pkg_toc*)db->package_toc; uint64_t i; for (i = 0; toc[i].fname; i++) free(toc[i].fname); free(db->package_toc); } if (db->package_handle) fclose(db->package_handle); } static int toc_get(const tdb *db, const char *fname, uint64_t *offset, uint64_t *size) { const struct pkg_toc *toc = (const struct pkg_toc*)db->package_toc; uint64_t i; /* NOTE we find the matching file using a linear scan below. This shouldn't be a problem UNLESS there are a very large number of fields and lexicon.* files, in which case the list can get long. It should be easy to replace this with a faster search like JudySL, if needed. */ for (i = 0; toc[i].fname; i++) if (!strcmp(toc[i].fname, fname)){ *offset = toc[i].offset; *size = toc[i].size; return 0; } return -1; } FILE *package_fopen(const char *fname, const char *root __attribute__((unused)), const tdb *db) { uint64_t offset, size; if (toc_get(db, fname, &offset, &size)) return NULL; if (fseek(db->package_handle, (off_t)offset, SEEK_SET) == -1) return NULL; return db->package_handle; } int package_fclose(FILE *f __attribute__((unused))) { /* we don't want to close db->package_handle */ return 0; } int package_mmap(const char *fname, const char *root __attribute__((unused)), struct tdb_file *dst, const tdb *db) { /* we need to page-align offsets for mmap() and adjust data pointers accordingly. dst->mmap_size and dst->ptr correspond to the page-aligned values, dst->size and dst->data to the values containing the actual data. */ int fd = fileno(db->package_handle); uint64_t offset, shift; if (toc_get(db, fname, &offset, &dst->size)) return -1; shift = offset & ((uint64_t)(getpagesize() - 1)); dst->mmap_size = dst->size + shift; offset -= shift; dst->ptr = mmap(NULL, dst->mmap_size, PROT_READ, MAP_SHARED, fd, (off_t)offset); if (dst->ptr == MAP_FAILED) return -1; dst->data = &dst->ptr[shift]; return 0; } traildb-0.6+dfsg1/src/pqueue/0000700000175000017500000000000013106440271015365 5ustar czchenczchentraildb-0.6+dfsg1/src/pqueue/pqueue.c0000600000175000017500000001522413106440271017043 0ustar czchenczchen/* * Copyright (c) 2014, Volkan Yazıcı * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, this * list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include #include #include #include "pqueue.h" #define left(i) ((i) << 1) #define right(i) (((i) << 1) + 1) #define parent(i) ((i) >> 1) pqueue_t * pqueue_init(size_t n, pqueue_cmp_pri_f cmppri, pqueue_get_pri_f getpri, pqueue_set_pri_f setpri, pqueue_get_pos_f getpos, pqueue_set_pos_f setpos) { pqueue_t *q; if (!(q = malloc(sizeof(pqueue_t)))) return NULL; /* Need to allocate n+1 elements since element 0 isn't used. */ if (!(q->d = malloc((n + 1) * sizeof(void *)))) { free(q); return NULL; } q->size = 1; q->avail = q->step = (n+1); /* see comment above about n+1 */ q->cmppri = cmppri; q->setpri = setpri; q->getpri = getpri; q->getpos = getpos; q->setpos = setpos; return q; } void pqueue_reset(pqueue_t *q) { q->size = 1; } void pqueue_free(pqueue_t *q) { free(q->d); free(q); } size_t pqueue_size(pqueue_t *q) { /* queue element 0 exists but doesn't count since it isn't used. */ return (q->size - 1); } static void bubble_up(pqueue_t *q, size_t i) { size_t parent_node; void *moving_node = q->d[i]; pqueue_pri_t moving_pri = q->getpri(moving_node); for (parent_node = parent(i); ((i > 1) && q->cmppri(q->getpri(q->d[parent_node]), moving_pri)); i = parent_node, parent_node = parent(i)) { q->d[i] = q->d[parent_node]; q->setpos(q->d[i], i); } q->d[i] = moving_node; q->setpos(moving_node, i); } static size_t maxchild(pqueue_t *q, size_t i) { size_t child_node = left(i); if (child_node >= q->size) return 0; if ((child_node+1) < q->size && q->cmppri(q->getpri(q->d[child_node]), q->getpri(q->d[child_node+1]))) child_node++; /* use right child instead of left */ return child_node; } static void percolate_down(pqueue_t *q, size_t i) { size_t child_node; void *moving_node = q->d[i]; pqueue_pri_t moving_pri = q->getpri(moving_node); while ((child_node = maxchild(q, i)) && q->cmppri(moving_pri, q->getpri(q->d[child_node]))) { q->d[i] = q->d[child_node]; q->setpos(q->d[i], i); i = child_node; } q->d[i] = moving_node; q->setpos(moving_node, i); } int pqueue_insert(pqueue_t *q, void *d) { void *tmp; size_t i; size_t newsize; if (!q) return 1; /* allocate more memory if necessary */ if (q->size >= q->avail) { newsize = q->size + q->step; if (!(tmp = realloc(q->d, sizeof(void *) * newsize))) return 1; q->d = tmp; q->avail = newsize; } /* insert item */ i = q->size++; q->d[i] = d; bubble_up(q, i); return 0; } void pqueue_change_priority(pqueue_t *q, pqueue_pri_t new_pri, void *d) { size_t posn; pqueue_pri_t old_pri = q->getpri(d); q->setpri(d, new_pri); posn = q->getpos(d); if (q->cmppri(old_pri, new_pri)) bubble_up(q, posn); else percolate_down(q, posn); } int pqueue_remove(pqueue_t *q, void *d) { size_t posn = q->getpos(d); q->d[posn] = q->d[--q->size]; if (q->cmppri(q->getpri(d), q->getpri(q->d[posn]))) bubble_up(q, posn); else percolate_down(q, posn); return 0; } void * pqueue_pop(pqueue_t *q) { void *head; if (!q || q->size == 1) return NULL; head = q->d[1]; q->d[1] = q->d[--q->size]; percolate_down(q, 1); return head; } void * pqueue_peek(pqueue_t *q) { void *d; if (!q || q->size == 1) return NULL; d = q->d[1]; return d; } #if 0 void pqueue_dump(pqueue_t *q, FILE *out, pqueue_print_entry_f print) { int i; fprintf(stdout,"posn\tleft\tright\tparent\tmaxchild\t...\n"); for (i = 1; i < q->size ;i++) { fprintf(stdout, "%d\t%d\t%d\t%d\t%ul\t", i, left(i), right(i), parent(i), (unsigned int)maxchild(q, i)); print(out, q->d[i]); } } #endif static void set_pos(void *d, size_t val) { /* do nothing */ } static void set_pri(void *d, pqueue_pri_t pri) { /* do nothing */ } void pqueue_print(pqueue_t *q, FILE *out, pqueue_print_entry_f print) { pqueue_t *dup; void *e; dup = pqueue_init(q->size, q->cmppri, q->getpri, set_pri, q->getpos, set_pos); dup->size = q->size; dup->avail = q->avail; dup->step = q->step; memcpy(dup->d, q->d, (q->size * sizeof(void *))); while ((e = pqueue_pop(dup))) print(out, e); pqueue_free(dup); } #if 0 static int subtree_is_valid(pqueue_t *q, int pos) { if (left(pos) < q->size) { /* has a left child */ if (q->cmppri(q->getpri(q->d[pos]), q->getpri(q->d[left(pos)]))) return 0; if (!subtree_is_valid(q, left(pos))) return 0; } if (right(pos) < q->size) { /* has a right child */ if (q->cmppri(q->getpri(q->d[pos]), q->getpri(q->d[right(pos)]))) return 0; if (!subtree_is_valid(q, right(pos))) return 0; } return 1; } int pqueue_is_valid(pqueue_t *q) { return subtree_is_valid(q, 1); } #endif traildb-0.6+dfsg1/src/pqueue/LICENSE0000600000175000017500000000245713106440271016404 0ustar czchenczchenCopyright (c) 2014, Volkan Yazıcı All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. traildb-0.6+dfsg1/src/pqueue/README.md0000600000175000017500000000125313106440271016647 0ustar czchenczchen L oo.ooooo. .ooooo oo oooo oooo .ooooo. oooo oooo .ooooo. 888' `88b d88' `888 `888 `888 d88' `88b `888 `888 d88' `88b 888 888 888 888 888 888 888ooo888 888 888 888ooo888 I 888 888 888 888 888 888 888 .o 888 888 888 .o 888bod8P' `V8bod888 `V88V"V8P' `Y8bod8P' `V88V"V8P' `Y8bod8P' 888 888. B o888o 8P' " `libpqueue` is a generic priority queue (heap) implementation used by the Apache HTTP Server project. (Particularly, 2.2.14 release.) I just tidied up the source and API a little bit, introduced some minor functionality, etc. traildb-0.6+dfsg1/src/pqueue/pqueue.h0000600000175000017500000001302213106440271017042 0ustar czchenczchen/* * Copyright (c) 2014, Volkan Yazıcı * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, this * list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /** * @file pqueue.h * @brief Priority Queue function declarations * * @{ */ #ifndef PQUEUE_H #define PQUEUE_H /** priority data type */ typedef unsigned long long pqueue_pri_t; /** callback functions to get/set/compare the priority of an element */ typedef pqueue_pri_t (*pqueue_get_pri_f)(void *a); typedef void (*pqueue_set_pri_f)(void *a, pqueue_pri_t pri); typedef int (*pqueue_cmp_pri_f)(pqueue_pri_t next, pqueue_pri_t curr); /** callback functions to get/set the position of an element */ typedef size_t (*pqueue_get_pos_f)(void *a); typedef void (*pqueue_set_pos_f)(void *a, size_t pos); /** debug callback function to print a entry */ typedef void (*pqueue_print_entry_f)(FILE *out, void *a); /** the priority queue handle */ typedef struct pqueue_t { size_t size; /**< number of elements in this queue */ size_t avail; /**< slots available in this queue */ size_t step; /**< growth stepping setting */ pqueue_cmp_pri_f cmppri; /**< callback to compare nodes */ pqueue_get_pri_f getpri; /**< callback to get priority of a node */ pqueue_set_pri_f setpri; /**< callback to set priority of a node */ pqueue_get_pos_f getpos; /**< callback to get position of a node */ pqueue_set_pos_f setpos; /**< callback to set position of a node */ void **d; /**< The actualy queue in binary heap form */ } pqueue_t; /** * initialize the queue * * @param n the initial estimate of the number of queue items for which memory * should be preallocated * @param cmppri The callback function to run to compare two elements * This callback should return 0 for 'lower' and non-zero * for 'higher', or vice versa if reverse priority is desired * @param setpri the callback function to run to assign a score to an element * @param getpri the callback function to run to set a score to an element * @param getpos the callback function to get the current element's position * @param setpos the callback function to set the current element's position * * @return the handle or NULL for insufficent memory */ pqueue_t * pqueue_init(size_t n, pqueue_cmp_pri_f cmppri, pqueue_get_pri_f getpri, pqueue_set_pri_f setpri, pqueue_get_pos_f getpos, pqueue_set_pos_f setpos); /* added for TrailDB */ void pqueue_reset(pqueue_t *q); /** * free all memory used by the queue * @param q the queue */ void pqueue_free(pqueue_t *q); /** * return the size of the queue. * @param q the queue */ size_t pqueue_size(pqueue_t *q); /** * insert an item into the queue. * @param q the queue * @param d the item * @return 0 on success */ int pqueue_insert(pqueue_t *q, void *d); /** * move an existing entry to a different priority * @param q the queue * @param new_pri the new priority * @param d the entry */ void pqueue_change_priority(pqueue_t *q, pqueue_pri_t new_pri, void *d); /** * pop the highest-ranking item from the queue. * @param q the queue * @return NULL on error, otherwise the entry */ void *pqueue_pop(pqueue_t *q); /** * remove an item from the queue. * @param q the queue * @param d the entry * @return 0 on success */ int pqueue_remove(pqueue_t *q, void *d); /** * access highest-ranking item without removing it. * @param q the queue * @return NULL on error, otherwise the entry */ void *pqueue_peek(pqueue_t *q); /** * print the queue * @internal * DEBUG function only * @param q the queue * @param out the output handle * @param the callback function to print the entry */ void pqueue_print(pqueue_t *q, FILE *out, pqueue_print_entry_f print); /** * dump the queue and it's internal structure * @internal * debug function only * @param q the queue * @param out the output handle * @param the callback function to print the entry */ void pqueue_dump(pqueue_t *q, FILE *out, pqueue_print_entry_f print); /** * checks that the pq is in the right order, etc * @internal * debug function only * @param q the queue */ int pqueue_is_valid(pqueue_t *q); #endif /* PQUEUE_H */ /** @} */ traildb-0.6+dfsg1/src/tdb.c0000600000175000017500000007302713106440271015011 0ustar czchenczchen#define _DEFAULT_SOURCE /* for getline() */ #define _BSD_SOURCE /* for madvise() */ #define _GNU_SOURCE /* for getline() - older compilers? */ #include #include #include #include #include #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "tdb_internal.h" #include "tdb_error.h" #include "tdb_io.h" #include "tdb_huffman.h" #include "tdb_package.h" #define DEFAULT_OPT_CURSOR_EVENT_BUFFER_SIZE 1000 struct io_ops{ FILE* (*fopen)(const char *fname, const char *root, const tdb *db); int (*fclose)(FILE *f); int (*mmap)(const char *fname, const char *root, struct tdb_file *dst, const tdb *db); }; int file_mmap(const char *fname, const char *root, struct tdb_file *dst, const tdb *db __attribute__((unused))) { char path[TDB_MAX_PATH_SIZE]; int fd = 0; int ret = 0; struct stat stats; if (root){ TDB_PATH(path, "%s/%s", root, fname); }else{ TDB_PATH(path, "%s", fname); } if ((fd = open(path, O_RDONLY)) == -1) return -1; if (fstat(fd, &stats)){ ret = -1; goto done; } dst->size = dst->mmap_size = (uint64_t)stats.st_size; dst->data = dst->ptr = MAP_FAILED; if (dst->size > 0) dst->ptr = mmap(NULL, dst->size, PROT_READ, MAP_SHARED, fd, 0); if (dst->ptr == MAP_FAILED){ ret = -1; goto done; } dst->data = dst->ptr; done: if (fd) close(fd); return ret; } static FILE *file_fopen(const char *fname, const char *root, const tdb *db __attribute__((unused))) { char path[TDB_MAX_PATH_SIZE]; int ret = 0; FILE *f; TDB_PATH(path, "%s/%s", root, fname); TDB_OPEN(f, path, "r"); done: if (ret) return NULL; else return f; } static int file_fclose(FILE *f) { return fclose(f); } void tdb_lexicon_read(const tdb *db, tdb_field field, struct tdb_lexicon *lex) { lex->version = db->version; lex->data = db->lexicons[field - 1].data; lex->size = 0; if (db->lexicons[field - 1].size > UINT32_MAX){ lex->width = 8; lex->toc.toc64 = (const uint64_t*)&lex->data[lex->width]; memcpy(&lex->size, lex->data, 8); }else{ lex->width = 4; lex->toc.toc32 = (const uint32_t*)&lex->data[lex->width]; memcpy(&lex->size, lex->data, 4); } } static inline uint64_t tdb_lex_offset(const struct tdb_lexicon *lex, tdb_val i) { if (lex->width == 4) return lex->toc.toc32[i]; else return lex->toc.toc64[i]; } const char *tdb_lexicon_get(const struct tdb_lexicon *lex, tdb_val i, uint64_t *length) { if (lex->version == TDB_VERSION_V0){ /* backwards compatibility with 0-terminated strings in v0 */ *length = (uint64_t)strlen(&lex->data[tdb_lex_offset(lex, i)]); }else *length = tdb_lex_offset(lex, i + 1) - tdb_lex_offset(lex, i); return &lex->data[tdb_lex_offset(lex, i)]; } static tdb_error fields_open(tdb *db, const char *root, struct io_ops *io) { char path[TDB_MAX_PATH_SIZE]; FILE *f = NULL; char *line = NULL; size_t n = 0; tdb_field i, num_ofields = 0; int ret = 0; int ok = 0; if (!(f = io->fopen("fields", root, db))) return TDB_ERR_INVALID_FIELDS_FILE; while (getline(&line, &n, f) != -1){ if (line[0] == '\n'){ /* V0 tdbs don't have the extra newline, they should read until EOF */ ok = 1; break; } ++num_ofields; } if (!(ok || feof(f))){ /* we can get here if malloc fails inside getline() */ ret = TDB_ERR_NOMEM; goto done; } db->num_fields = num_ofields + 1U; if (!(db->field_names = calloc(db->num_fields, sizeof(char*)))){ ret = TDB_ERR_NOMEM; goto done; } if (num_ofields){ if (!(db->lexicons = calloc(num_ofields, sizeof(struct tdb_file)))){ ret = TDB_ERR_NOMEM; goto done; } }else db->lexicons = NULL; /* io_ops doesn't support rewind(), so we have to close and reopen */ io->fclose(f); if (!(f = io->fopen("fields", root, db))){ ret = TDB_ERR_IO_OPEN; goto done; } db->field_names[0] = "time"; for (i = 1; getline(&line, &n, f) != -1 && i < db->num_fields; i++){ line[strlen(line) - 1] = 0; /* let's be paranoid and sanity check the fieldname again */ if (is_fieldname_invalid(line)){ ret = TDB_ERR_INVALID_FIELDS_FILE; goto done; } if (!(db->field_names[i] = strdup(line))){ ret = TDB_ERR_NOMEM; goto done; } TDB_PATH(path, "lexicon.%s", line); if (io->mmap(path, root, &db->lexicons[i - 1], db)){ ret = TDB_ERR_INVALID_LEXICON_FILE; goto done; } } if (i != db->num_fields){ ret = TDB_ERR_INVALID_FIELDS_FILE; goto done; } done: free(line); if (f) io->fclose(f); return ret; } static tdb_error init_field_stats(tdb *db) { uint64_t *field_cardinalities = NULL; tdb_field i; int ret = 0; if (db->num_fields > 1){ if (!(field_cardinalities = calloc(db->num_fields - 1, 8))) return TDB_ERR_NOMEM; } for (i = 1; i < db->num_fields; i++){ struct tdb_lexicon lex; tdb_lexicon_read(db, i, &lex); field_cardinalities[i - 1] = lex.size; } if (!(db->field_stats = huff_field_stats(field_cardinalities, db->num_fields, db->max_timestamp_delta))) ret = TDB_ERR_NOMEM; free(field_cardinalities); return ret; } static tdb_error read_version(tdb *db, const char *root, struct io_ops *io) { FILE *f; int ret = 0; if (!(f = io->fopen("version", root, db))) db->version = 0; else{ if (fscanf(f, "%"PRIu64, &db->version) != 1) ret = TDB_ERR_INVALID_VERSION_FILE; else if (db->version > TDB_VERSION_LATEST) ret = TDB_ERR_INCOMPATIBLE_VERSION; io->fclose(f); } return ret; } static tdb_error read_info(tdb *db, const char *root, struct io_ops *io) { FILE *f; int ret = 0; if (!(f = io->fopen("info", root, db))) return TDB_ERR_INVALID_INFO_FILE; if (fscanf(f, "%"PRIu64" %"PRIu64" %"PRIu64" %"PRIu64" %"PRIu64, &db->num_trails, &db->num_events, &db->min_timestamp, &db->max_timestamp, &db->max_timestamp_delta) != 5) ret = TDB_ERR_INVALID_INFO_FILE; io->fclose(f); return ret; } TDB_EXPORT tdb *tdb_init(void) { return calloc(1, sizeof(tdb)); } TDB_EXPORT tdb_error tdb_open(tdb *db, const char *orig_root) { char root[TDB_MAX_PATH_SIZE]; struct stat stats; tdb_error ret = 0; struct io_ops io; /* by handling the "db == NULL" case here gracefully, we allow the return value of tdb_init() to be used unchecked like here: int err; tdb *db = tdb_init(); if ((err = tdb_open(db, path))) printf("Opening tbd failed: %s", tdb_error(err)); */ if (!db) return TDB_ERR_HANDLE_IS_NULL; if (db->num_fields) return TDB_ERR_HANDLE_ALREADY_OPENED; /* set default options */ db->opt_cursor_event_buffer_size = DEFAULT_OPT_CURSOR_EVENT_BUFFER_SIZE; TDB_PATH(root, "%s", orig_root); if (stat(root, &stats) == -1){ TDB_PATH(root, "%s.tdb", orig_root); if (stat(root, &stats) == -1){ ret = TDB_ERR_IO_OPEN; goto done; } } if (S_ISDIR(stats.st_mode)){ /* open tdb in a directory */ io.fopen = file_fopen; io.fclose = file_fclose; io.mmap = file_mmap; }else{ /* open tdb in a tarball */ io.fopen = package_fopen; io.fclose = package_fclose; io.mmap = package_mmap; if ((ret = open_package(db, root))) goto done; } if ((ret = read_info(db, root, &io))) goto done; if ((ret = read_version(db, root, &io))) goto done; if ((ret = fields_open(db, root, &io))) goto done; if ((ret = init_field_stats(db))) goto done; if (db->num_trails) { /* backwards compatibility: UUIDs used to be called cookies */ if (db->version == TDB_VERSION_V0){ if (io.mmap("cookies", root, &db->uuids, db)){ ret = TDB_ERR_INVALID_UUIDS_FILE; goto done; } }else{ if (io.mmap("uuids", root, &db->uuids, db)){ ret = TDB_ERR_INVALID_UUIDS_FILE; goto done; } } if (io.mmap("trails.codebook", root, &db->codebook, db)){ ret = TDB_ERR_INVALID_CODEBOOK_FILE; goto done; } if (db->version == TDB_VERSION_V0) if ((ret = huff_convert_v0_codebook(&db->codebook))) goto done; if (io.mmap("trails.toc", root, &db->toc, db)){ ret = TDB_ERR_INVALID_TRAILS_FILE; goto done; } if (io.mmap("trails.data", root, &db->trails, db)){ ret = TDB_ERR_INVALID_TRAILS_FILE; goto done; } } done: free_package(db); return ret; } static void tdb_madvise(const tdb *db, int advice) { if (db && db->num_fields > 0){ tdb_field i; for (i = 0; i < db->num_fields - 1; i++) madvise(db->lexicons[i].ptr, db->lexicons[i].mmap_size, advice); madvise(db->uuids.ptr, db->uuids.mmap_size, advice); madvise(db->codebook.ptr, db->codebook.mmap_size, advice); madvise(db->toc.ptr, db->toc.mmap_size, advice); madvise(db->trails.ptr, db->trails.mmap_size, advice); } } TDB_EXPORT void tdb_willneed(const tdb *db) { tdb_madvise(db, MADV_WILLNEED); } TDB_EXPORT void tdb_dontneed(const tdb *db) { tdb_madvise(db, MADV_DONTNEED); } TDB_EXPORT void tdb_close(tdb *db) { if (db){ tdb_field i; Word_t tmp; if (db->num_fields > 0){ for (i = 0; i < db->num_fields - 1; i++){ free(db->field_names[i + 1]); if (db->lexicons[i].ptr) munmap(db->lexicons[i].ptr, db->lexicons[i].mmap_size); } } if (db->uuids.ptr) munmap(db->uuids.ptr, db->uuids.mmap_size); if (db->codebook.ptr) munmap(db->codebook.ptr, db->codebook.mmap_size); if (db->toc.ptr) munmap(db->toc.ptr, db->toc.mmap_size); if (db->trails.ptr) munmap(db->trails.ptr, db->trails.mmap_size); JLFA(tmp, db->opt_trail_event_filters); free(db->lexicons); free(db->field_names); free(db->field_stats); free(db); } out_of_memory: return; } TDB_EXPORT uint64_t tdb_lexicon_size(const tdb *db, tdb_field field) { if (field == 0 || field >= db->num_fields) return 0; else{ struct tdb_lexicon lex; tdb_lexicon_read(db, field, &lex); /* +1 refers to the implicit NULL value (empty string) */ return lex.size + 1; } } TDB_EXPORT tdb_error tdb_get_field(const tdb *db, const char *field_name, tdb_field *field) { tdb_field i; for (i = 0; i < db->num_fields; i++) if (!strcmp(field_name, db->field_names[i])){ *field = i; return 0; } return TDB_ERR_UNKNOWN_FIELD; } TDB_EXPORT const char *tdb_get_field_name(const tdb *db, tdb_field field) { if (field < db->num_fields) return db->field_names[field]; return NULL; } TDB_EXPORT tdb_item tdb_get_item(const tdb *db, tdb_field field, const char *value, uint64_t value_length) { if (!value_length) /* NULL value for this field */ return tdb_make_item(field, 0); else if (field == 0 || field >= db->num_fields) return 0; else{ struct tdb_lexicon lex; tdb_val i; tdb_lexicon_read(db, field, &lex); for (i = 0; i < lex.size; i++){ uint64_t length; const char *token = tdb_lexicon_get(&lex, i, &length); if (length == value_length && !memcmp(token, value, length)) return tdb_make_item(field, i + 1); } return 0; } } TDB_EXPORT const char *tdb_get_value(const tdb *db, tdb_field field, tdb_val val, uint64_t *value_length) { if (field == 0 || field >= db->num_fields) return NULL; else if (!val){ /* a valid NULL value for a valid field */ *value_length = 0; return ""; }else{ struct tdb_lexicon lex; tdb_lexicon_read(db, field, &lex); if ((val - 1) < lex.size) return tdb_lexicon_get(&lex, val - 1, value_length); else return NULL; } } TDB_EXPORT const char *tdb_get_item_value(const tdb *db, tdb_item item, uint64_t *value_length) { return tdb_get_value(db, tdb_item_field(item), tdb_item_val(item), value_length); } TDB_EXPORT const uint8_t *tdb_get_uuid(const tdb *db, uint64_t trail_id) { if (trail_id < db->num_trails) return (const uint8_t *)&db->uuids.data[trail_id * 16]; return NULL; } TDB_EXPORT tdb_error tdb_get_trail_id(const tdb *db, const uint8_t *uuid, uint64_t *trail_id) { __uint128_t cmp, key; memcpy(&key, uuid, 16); if (db->version == TDB_VERSION_V0){ /* V0 doesn't guarantee that UUIDs would be ordered */ uint64_t idx; for (idx = 0; idx < db->num_trails; idx++){ memcpy(&cmp, &db->uuids.data[idx * 16], 16); if (key == cmp){ *trail_id = idx; return 0; } } }else{ /* note: TDB_MAX_NUM_TRAILS < 2^63, so we can safely use int64_t */ int64_t idx; int64_t left = 0; int64_t right = ((int64_t)db->num_trails) - 1LL; while (left <= right){ /* compute midpoint in an overflow-safe manner (see Wikipedia) */ idx = left + ((right - left) / 2); memcpy(&cmp, &db->uuids.data[idx * 16], 16); if (cmp == key){ *trail_id = (uint64_t)idx; return 0; }else if (cmp > key) right = idx - 1; else left = idx + 1; } } return TDB_ERR_UNKNOWN_UUID; } TDB_EXPORT const char *tdb_error_str(tdb_error errcode) { switch (errcode){ case TDB_ERR_OK: return "TDB_ERR_OK"; case TDB_ERR_NOMEM: return "TDB_ERR_NOMEM"; case TDB_ERR_PATH_TOO_LONG: return "TDB_ERR_PATH_TOO_LONG"; case TDB_ERR_UNKNOWN_FIELD: return "TDB_ERR_UNKNOWN_FIELD"; case TDB_ERR_UNKNOWN_UUID: return "TDB_ERR_UNKNOWN_UUID"; case TDB_ERR_INVALID_TRAIL_ID: return "TDB_ERR_INVALID_TRAIL_ID"; case TDB_ERR_HANDLE_IS_NULL: return "TDB_ERR_HANDLE_IS_NULL"; case TDB_ERR_HANDLE_ALREADY_OPENED: return "TDB_ERR_HANDLE_ALREADY_OPENED"; case TDB_ERR_UNKNOWN_OPTION: return "TDB_ERR_UNKNOWN_OPTION"; case TDB_ERR_INVALID_OPTION_VALUE: return "TDB_ERR_INVALID_OPTION_VALUE"; case TDB_ERR_INVALID_UUID: return "TDB_ERR_INVALID_UUID"; case TDB_ERR_IO_OPEN: return "TDB_ERR_IO_OPEN"; case TDB_ERR_IO_CLOSE: return "TDB_ERR_IO_CLOSE"; case TDB_ERR_IO_WRITE: return "TDB_ERR_IO_WRITE"; case TDB_ERR_IO_READ: return "TDB_ERR_IO_READ"; case TDB_ERR_IO_TRUNCATE: return "TDB_ERR_IO_TRUNCATE"; case TDB_ERR_IO_PACKAGE: return "TDB_ERR_IO_PACKAGE"; case TDB_ERR_INVALID_INFO_FILE: return "TDB_ERR_INVALID_INFO_FILE"; case TDB_ERR_INVALID_VERSION_FILE: return "TDB_ERR_INVALID_VERSION_FILE"; case TDB_ERR_INCOMPATIBLE_VERSION: return "TDB_ERR_INCOMPATIBLE_VERSION"; case TDB_ERR_INVALID_FIELDS_FILE: return "TDB_ERR_INVALID_FIELDS_FILE"; case TDB_ERR_INVALID_UUIDS_FILE: return "TDB_ERR_INVALID_UUIDS_FILE"; case TDB_ERR_INVALID_CODEBOOK_FILE: return "TDB_ERR_INVALID_CODEBOOK_FILE"; case TDB_ERR_INVALID_TRAILS_FILE: return "TDB_ERR_INVALID_TRAILS_FILE"; case TDB_ERR_INVALID_LEXICON_FILE: return "TDB_ERR_INVALID_LEXICON_FILE"; case TDB_ERR_INVALID_PACKAGE: return "TDB_ERR_INVALID_PACKAGE"; case TDB_ERR_TOO_MANY_FIELDS: return "TDB_ERR_TOO_MANY_FIELDS"; case TDB_ERR_DUPLICATE_FIELDS: return "TDB_ERR_DUPLICATE_FIELDS"; case TDB_ERR_INVALID_FIELDNAME: return "TDB_ERR_INVALID_FIELDNAME"; case TDB_ERR_TOO_MANY_TRAILS: return "TDB_ERR_TOO_MANY_TRAILS"; case TDB_ERR_VALUE_TOO_LONG: return "TDB_ERR_VALUE_TOO_LONG"; case TDB_ERR_APPEND_FIELDS_MISMATCH: return "TDB_ERR_APPEND_FIELDS_MISMATCH"; case TDB_ERR_LEXICON_TOO_LARGE: return "TDB_ERR_LEXICON_TOO_LARGE"; case TDB_ERR_TIMESTAMP_TOO_LARGE: return "TDB_ERR_TIMESTAMP_TOO_LARGE"; case TDB_ERR_TRAIL_TOO_LONG: return "TDB_ERR_TRAIL_TOO_LONG"; case TDB_ERR_ONLY_DIFF_FILTER: return "TDB_ERR_ONLY_DIFF_FILTER"; case TDB_ERR_NO_SUCH_ITEM: return "TDB_ERR_NO_SUCH_ITEM"; case TDB_ERR_INVALID_RANGE: return "TDB_ERR_INVALID_RANGE"; case TDB_ERR_INCORRECT_TERM_TYPE: return "TDB_ERR_INCORRECT_TERM_TYPE"; default: return "Unknown error"; } } TDB_EXPORT uint64_t tdb_num_trails(const tdb *db) { return db->num_trails; } TDB_EXPORT uint64_t tdb_num_events(const tdb *db) { return db->num_events; } TDB_EXPORT uint64_t tdb_num_fields(const tdb *db) { return db->num_fields; } TDB_EXPORT uint64_t tdb_min_timestamp(const tdb *db) { return db->min_timestamp; } TDB_EXPORT uint64_t tdb_max_timestamp(const tdb *db) { return db->max_timestamp; } TDB_EXPORT uint64_t tdb_version(const tdb *db) { return db->version; } TDB_EXPORT tdb_error tdb_set_opt(tdb *db, tdb_opt_key key, tdb_opt_value value) { /* NOTE: If a new option can cause the db return a subset of events, like TDB_OPT_ONLY_DIFF_ITEMS or TDB_OPT_EVENT_FILTER, you need to add them to the list in tdb_cons_append(). */ switch (key){ case TDB_OPT_ONLY_DIFF_ITEMS: db->opt_edge_encoded = value.value ? 1: 0; return 0; case TDB_OPT_EVENT_FILTER: db->opt_event_filter = (const struct tdb_event_filter*)value.ptr; return 0; case TDB_OPT_CURSOR_EVENT_BUFFER_SIZE: if (value.value > 0){ db->opt_cursor_event_buffer_size = value.value; return 0; }else return TDB_ERR_INVALID_OPTION_VALUE; default: return TDB_ERR_UNKNOWN_OPTION; } } TDB_EXPORT tdb_error tdb_get_opt(tdb *db, tdb_opt_key key, tdb_opt_value *value) { switch (key){ case TDB_OPT_ONLY_DIFF_ITEMS: *value = db->opt_edge_encoded ? TDB_TRUE: TDB_FALSE; return 0; case TDB_OPT_EVENT_FILTER: value->ptr = db->opt_event_filter; return 0; case TDB_OPT_CURSOR_EVENT_BUFFER_SIZE: value->value = db->opt_cursor_event_buffer_size; return 0; default: return TDB_ERR_UNKNOWN_OPTION; } } TDB_EXPORT tdb_error tdb_set_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value value) { Word_t *ptr; int tmp; if (trail_id >= db->num_trails) return TDB_ERR_INVALID_TRAIL_ID; switch (key){ case TDB_OPT_EVENT_FILTER: if (value.ptr){ JLI(ptr, db->opt_trail_event_filters, trail_id); *ptr = (Word_t)value.ptr; }else{ JLD(tmp, db->opt_trail_event_filters, trail_id); } return 0; default: return TDB_ERR_UNKNOWN_OPTION; } out_of_memory: return TDB_ERR_NOMEM; } tdb_error tdb_get_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value *value) { Word_t *ptr; if (trail_id >= db->num_trails) return TDB_ERR_INVALID_TRAIL_ID; switch (key){ case TDB_OPT_EVENT_FILTER: JLG(ptr, db->opt_trail_event_filters, trail_id); if (ptr) value->ptr = (const void*)*ptr; else value->ptr = NULL; return 0; default: return TDB_ERR_UNKNOWN_OPTION; } } TDB_EXPORT struct tdb_event_filter *tdb_event_filter_new(void) { struct tdb_event_filter *f = calloc(1, sizeof(struct tdb_event_filter)); if (f){ f->size = 5; if (!(f->items = calloc(1, f->size * sizeof(tdb_item)))){ free(f); return NULL; } f->count = 1; f->clause_len_idx = 0; } return f; } TDB_EXPORT struct tdb_event_filter *tdb_event_filter_new_match_all(void) { struct tdb_event_filter *f = tdb_event_filter_new(); if (f) f->options = TDB_FILTER_MATCH_ALL; return f; } TDB_EXPORT struct tdb_event_filter *tdb_event_filter_new_match_none(void) { struct tdb_event_filter *f = tdb_event_filter_new(); if (f) f->options = TDB_FILTER_MATCH_NONE; return f; } static tdb_error ensure_filter_size(struct tdb_event_filter *filter) { /* ensure we can fit the largest term (time range) in the array */ if (filter->count + 3 >= filter->size){ filter->size *= 2; filter->items = realloc(filter->items, filter->size * sizeof(tdb_item)); if (!filter->items) return TDB_ERR_NOMEM; } return TDB_ERR_OK; } TDB_EXPORT tdb_error tdb_event_filter_add_term(struct tdb_event_filter *filter, tdb_item term, int is_negative) { tdb_error ret; if ((ret = ensure_filter_size(filter))) return ret; else{ filter->items[filter->count++] = (is_negative ? TDB_EVENT_NEGATED: 0); filter->items[filter->count++] = term; filter->items[filter->clause_len_idx] += 2; return TDB_ERR_OK; } } TDB_EXPORT tdb_error tdb_event_filter_add_time_range(struct tdb_event_filter *filter, uint64_t start_time, uint64_t end_time) { if (end_time <= start_time) return TDB_ERR_INVALID_RANGE; tdb_error ret; if ((ret = ensure_filter_size(filter))) return ret; uint64_t query_flags = TDB_EVENT_TIME_RANGE; filter->items[filter->count++] = query_flags; filter->items[filter->count++] = start_time; filter->items[filter->count++] = end_time; filter->items[filter->clause_len_idx] += 3; return TDB_ERR_OK; } TDB_EXPORT tdb_error tdb_event_filter_new_clause(struct tdb_event_filter *filter) { tdb_error ret; if ((ret = ensure_filter_size(filter))) return ret; else{ filter->clause_len_idx = filter->count++; filter->items[filter->clause_len_idx] = 0; return TDB_ERR_OK; } } TDB_EXPORT void tdb_event_filter_free(struct tdb_event_filter *filter) { if(filter){ free(filter->items); free(filter); } } TDB_EXPORT tdb_error tdb_event_filter_get_term_type(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t term_index, tdb_event_filter_term_type *term_type) { *term_type = TDB_EVENT_FILTER_UNKNOWN_TERM; uint64_t clause, i; for (clause = 0, i = 0; clause < clause_index; clause++){ i += filter->items[i] + 1; if (i == filter->count) { return TDB_ERR_NO_SUCH_ITEM; } } uint64_t clause_len = filter->items[i++]; uint64_t next_clause_idx = i + clause_len; uint64_t current_item_idx = 0; while (i < next_clause_idx) { if (current_item_idx == term_index) { *term_type = (filter->items[i] & TDB_EVENT_TIME_RANGE) ? TDB_EVENT_FILTER_TIME_RANGE_TERM : TDB_EVENT_FILTER_MATCH_TERM; return TDB_ERR_OK; } i += filter->items[i] & TDB_EVENT_TIME_RANGE ? 3 : 2; current_item_idx++; } return TDB_ERR_NO_SUCH_ITEM; } TDB_EXPORT tdb_error tdb_event_filter_get_item(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t item_index, tdb_item *item, int *is_negative) { tdb_error ret = TDB_ERR_NO_SUCH_ITEM; uint64_t clause, i; for (clause = 0, i = 0; clause < clause_index; clause++){ i += filter->items[i] + 1; if (i == filter->count) { goto invalid; } } uint64_t clause_len = filter->items[i++]; uint64_t next_clause_idx = i + clause_len; uint64_t current_item_idx = 0; while (i < next_clause_idx) { if (current_item_idx == item_index) { if (filter->items[i] & TDB_EVENT_TIME_RANGE) { ret = TDB_ERR_INCORRECT_TERM_TYPE; goto invalid; } *is_negative = filter->items[i] & TDB_EVENT_NEGATED ? 1 : 0; *item = filter->items[i + 1]; return TDB_ERR_OK; } i += filter->items[i] & TDB_EVENT_TIME_RANGE ? 3 : 2; current_item_idx++; } invalid: *item = 0; *is_negative = 0; return ret; } TDB_EXPORT tdb_error tdb_event_filter_get_time_range(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t item_index, uint64_t *start_time, uint64_t *end_time) { *start_time = 0; *end_time = 0; uint64_t clause, i; for (clause = 0, i = 0; clause < clause_index; clause++){ i += filter->items[i] + 1; if (i == filter->count) { return TDB_ERR_NO_SUCH_ITEM; } } uint64_t clause_len = filter->items[i++]; uint64_t next_clause_idx = i + clause_len; uint64_t current_item_idx = 0; while (i < next_clause_idx) { if (current_item_idx == item_index) { if (!(filter->items[i] & TDB_EVENT_TIME_RANGE)) { return TDB_ERR_INCORRECT_TERM_TYPE; } *start_time = filter->items[i + 1]; *end_time = filter->items[i + 2]; return TDB_ERR_OK; } i += filter->items[i] & TDB_EVENT_TIME_RANGE ? 3 : 2; current_item_idx++; } return TDB_ERR_NO_SUCH_ITEM; } TDB_EXPORT uint64_t tdb_event_filter_num_clauses( const struct tdb_event_filter *filter) { uint64_t num_clauses, i; for (num_clauses = 0, i = 0; i < filter->count; num_clauses++) i += filter->items[i] + 1; return num_clauses; } TDB_EXPORT tdb_error tdb_event_filter_num_terms(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t *num_terms) { uint64_t clause, i; for (clause = 0, i = 0; clause < clause_index; clause++){ i += filter->items[i] + 1; if (i == filter->count) { return TDB_ERR_NO_SUCH_ITEM; } } *num_terms = 0; uint64_t clause_len = filter->items[i++]; uint64_t next_clause_idx = i + clause_len; while (i < next_clause_idx) { i += filter->items[i] & TDB_EVENT_TIME_RANGE ? 3 : 2; (*num_terms)++; } return TDB_ERR_OK; } traildb-0.6+dfsg1/src/arena.c0000600000175000017500000000234413106440271015320 0ustar czchenczchen #include #include #include #include "arena.h" #include "tdb_io.h" int arena_flush(const struct arena *a) { int ret = 0; if (a->fd && a->next){ uint64_t size = (((a->next - 1) & (ARENA_DISK_BUFFER - 1)) + 1) * (uint64_t)a->item_size; TDB_WRITE(a->fd, a->data, size); } done: return ret; } void *arena_add_item(struct arena *a) { if (a->failed) return NULL; if (a->fd){ if (a->size == 0){ a->size = ARENA_DISK_BUFFER; if (!(a->data = malloc(a->item_size * (uint64_t)a->size))){ a->failed = 1; return NULL; } }else if ((a->next & (ARENA_DISK_BUFFER - 1)) == 0){ if (arena_flush(a)) return NULL; } return a->data + a->item_size * (a->next++ & (ARENA_DISK_BUFFER - 1)); }else{ if (a->next >= a->size){ a->size += a->arena_increment ? a->arena_increment: ARENA_INCREMENT; if (!(a->data = realloc(a->data, a->item_size * (uint64_t)a->size))){ a->failed = 1; return NULL; } } return a->data + a->item_size * a->next++; } } traildb-0.6+dfsg1/src/tdb_limits.h0000600000175000017500000000312513106440271016367 0ustar czchenczchen #ifndef __TDB_LIMITS_H__ #define __TDB_LIMITS_H__ #include /* these are kept in stack, so they shouldn't be overly large */ #define TDB_MAX_PATH_SIZE 2048 #define TDB_MAX_FIELDNAME_LENGTH 512 /* MAX_NUM_TRAILS * 16 must fit in off_t (long) type */ #define TDB_MAX_NUM_TRAILS ((1LLU << 59) - 1) /* we need bit-level offsets to trails: At worst each item takes 64 bits, so the theoretical max is 2^64 / 2^6 = 2^58. To make things a bit safer, we set the max to 2^50. */ #define TDB_MAX_TRAIL_LENGTH ((1LLU << 50) - 1) /* re: fields and values below, see tdb_types.h */ /* re: -2, one field is always the special 'time' field */ #define TDB_MAX_NUM_FIELDS ((1LLU << 14) - 2) /* re: -2, one value is always the special NULL value */ #define TDB_MAX_NUM_VALUES ((1LLU << 40) - 2) /* timestamps have less future proofing than values, so TBD_MAX_TIMEDELTA can be higher than TDB_MAX_NUM_VALUES, see tdb_types.h for details */ #define TDB_MAX_TIMEDELTA ((1LLU << 47) - 1) /* 32-bit narrow items */ #define TDB_FIELD32_MAX 127 #define TDB_VAL32_MAX ((1LLU << 24) - 1) /* MAX_LEXICON_SIZE must fit in off_t type */ #define TDB_MAX_LEXICON_SIZE (1LLU << 59) /* TDB_MAX_VALUE_SIZE < MAX_LEXICON_SIZE - 16 */ #define TDB_MAX_VALUE_SIZE (1LLU << 58) /* Support a character set that allows easy urlencoding. These characters are used in filenames, so better to be extra paranoid. */ #define TDB_FIELDNAME_CHARS "_-%"\ "abcdefghijklmnopqrstuvwxyz"\ "ABCDEFGHIJKLMNOPQRSTUVWXYZ"\ "0123456789" #endif /* TDB_LIMITS */ traildb-0.6+dfsg1/src/traildb.h0000600000175000017500000002373613106440271015670 0ustar czchenczchen #ifndef __TRAILDB_H__ #define __TRAILDB_H__ #include #include #include "tdb_limits.h" #include "tdb_types.h" #include "tdb_error.h" #define TDB_VERSION_V0 0LLU #define TDB_VERSION_V0_1 1LLU #define TDB_VERSION_LATEST TDB_VERSION_V0_1 /* ----------------------- Construct a new TrailDB ----------------------- */ /* Init a new constructor handle */ tdb_cons *tdb_cons_init(void); /* Open a new constructor with a schema */ tdb_error tdb_cons_open(tdb_cons *cons, const char *root, const char **ofield_names, uint64_t num_ofields); /* Close a constructor handle */ void tdb_cons_close(tdb_cons *cons); /* Set constructor options */ tdb_error tdb_cons_set_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value value); /* Get constructor options */ tdb_error tdb_cons_get_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value *value); /* Add an event in the constructor */ tdb_error tdb_cons_add(tdb_cons *cons, const uint8_t uuid[16], const uint64_t timestamp, const char **values, const uint64_t *value_lengths); /* Merge an existing TrailDB to this constructor */ tdb_error tdb_cons_append(tdb_cons *cons, const tdb *db); /* Finalize a constructor */ tdb_error tdb_cons_finalize(tdb_cons *cons); /* --------------------------------- Open TrailDBs and access metadata --------------------------------- */ /* Init a new TrailDB handle */ tdb *tdb_init(void); /* Open a TrailDB */ tdb_error tdb_open(tdb *db, const char *root); /* Close a TrailDB */ void tdb_close(tdb *db); /* Inform the operating system that memory can be paged for this TrailDB */ void tdb_dontneed(const tdb *db); /* Inform the operating system that this TrailDB will be needed soon */ void tdb_willneed(const tdb *db); /* Get the number of trails */ uint64_t tdb_num_trails(const tdb *db); /* Get the number of events */ uint64_t tdb_num_events(const tdb *db); /* Get the number of fields */ uint64_t tdb_num_fields(const tdb *db); /* Get the oldest timestamp */ uint64_t tdb_min_timestamp(const tdb *db); /* Get the newest timestamp */ uint64_t tdb_max_timestamp(const tdb *db); /* Get the version of this TrailDB */ uint64_t tdb_version(const tdb *db); /* Translate an error code to a string */ const char *tdb_error_str(tdb_error errcode); /* Set a top-level option */ tdb_error tdb_set_opt(tdb *db, tdb_opt_key key, tdb_opt_value value); /* Get a top-level option */ tdb_error tdb_get_opt(tdb *db, tdb_opt_key key, tdb_opt_value *value); /* Set a trail-level option */ tdb_error tdb_set_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value value); /* Get a trail-level option */ tdb_error tdb_get_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value *value); /* ---------------------------------- Translate items to values and back ---------------------------------- */ /* Get the number of distinct values in the given field */ uint64_t tdb_lexicon_size(const tdb *db, tdb_field field); /* Get the field ID given a field name */ tdb_error tdb_get_field(const tdb *db, const char *field_name, tdb_field *field); /* Get the field name given a field ID */ const char *tdb_get_field_name(const tdb *db, tdb_field field); /* Get item corresponding to a value */ tdb_item tdb_get_item(const tdb *db, tdb_field field, const char *value, uint64_t value_length); /* Get value corresponding to a field, value ID pair */ const char *tdb_get_value(const tdb *db, tdb_field field, tdb_val val, uint64_t *value_length); /* Get value given an item */ const char *tdb_get_item_value(const tdb *db, tdb_item item, uint64_t *value_length); /* ------------ Handle UUIDs ------------ */ /* Get UUID given a Trail ID */ const uint8_t *tdb_get_uuid(const tdb *db, uint64_t trail_id); /* Get Trail ID given a UUID */ tdb_error tdb_get_trail_id(const tdb *db, const uint8_t uuid[16], uint64_t *trail_id); /* Translate a hex-encoded UUID to a raw 16-byte UUID */ tdb_error tdb_uuid_raw(const uint8_t hexuuid[32], uint8_t uuid[16]); /* Translate a raw 16-byte UUID to a hex-encoded UUID */ void tdb_uuid_hex(const uint8_t uuid[16], uint8_t hexuuid[32]); /* ------------ Event filter ------------ */ /* Create a new event filter */ struct tdb_event_filter *tdb_event_filter_new(void); /* Create a new event filter that matches all events */ struct tdb_event_filter *tdb_event_filter_new_match_all(void); /* Create a new event filter that matches nothing */ struct tdb_event_filter *tdb_event_filter_new_match_none(void); /* Add a new term (item) in an OR-clause */ tdb_error tdb_event_filter_add_term(struct tdb_event_filter *filter, tdb_item term, int is_negative); /* Add a timestamp range query (start_time <= timestamp < end_time) in an OR-clause */ tdb_error tdb_event_filter_add_time_range(struct tdb_event_filter *filter, uint64_t start_time, uint64_t end_time); /* Add a new clause, connected by AND to the previous clauses */ tdb_error tdb_event_filter_new_clause(struct tdb_event_filter *filter); /* Free an event filter */ void tdb_event_filter_free(struct tdb_event_filter *filter); /* Get term type for a term in a clause */ tdb_error tdb_event_filter_get_term_type(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t term_index, tdb_event_filter_term_type *term_type); /* Get an item in a clause */ tdb_error tdb_event_filter_get_item(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t item_index, tdb_item *item, int *is_negative); /* Get time-range term in a clause */ tdb_error tdb_event_filter_get_time_range(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t term_index, uint64_t *start_time, uint64_t *end_time); /* Get the number of clauses in this filter */ uint64_t tdb_event_filter_num_clauses(const struct tdb_event_filter *filter); /* Get the number of terms in a clause */ tdb_error tdb_event_filter_num_terms(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t *num_terms); /* ------------ Trail cursor ------------ */ /* Create a new cursor */ tdb_cursor *tdb_cursor_new(const tdb *db); /* Free a cursor */ void tdb_cursor_free(tdb_cursor *cursor); /* Reset the cursor to the given Trail ID */ tdb_error tdb_get_trail(tdb_cursor *cursor, uint64_t trail_id); /* Get the number of events remaining in this cursor */ uint64_t tdb_get_trail_length(tdb_cursor *cursor); /* Set an event filter for this cursor */ tdb_error tdb_cursor_set_event_filter(tdb_cursor *cursor, const struct tdb_event_filter *filter); /* Unset an event filter */ void tdb_cursor_unset_event_filter(tdb_cursor *cursor); /* Internal function used by tdb_cursor_next() */ int _tdb_cursor_next_batch(tdb_cursor *cursor); /* ------------ Multi cursor ------------ */ /* Create a new multicursor */ tdb_multi_cursor *tdb_multi_cursor_new(tdb_cursor **cursors, uint64_t num_cursors); /* Reset the multicursor to reflect the underlying status of individual cursors. Call after tdb_get_trail() or tdb_cursor_next() */ void tdb_multi_cursor_reset(tdb_multi_cursor *mc); /* Return next event in the timestamp order from the underlying cursors */ const tdb_multi_event *tdb_multi_cursor_next(tdb_multi_cursor *mcursor); /* Return a batch of maximum max_events in the timestamp order from the underlying cursors */ uint64_t tdb_multi_cursor_next_batch(tdb_multi_cursor *mcursor, tdb_multi_event *events, uint64_t max_events); /* Peek the next event in the cursor */ const tdb_multi_event *tdb_multi_cursor_peek(tdb_multi_cursor *mcursor); /* Free multicursors */ void tdb_multi_cursor_free(tdb_multi_cursor *mcursor); /* Return the next event from the cursor tdb_cursor_next() is defined here so it can be inlined the pragma is a workaround for older GCCs that have this issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54113 */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wmissing-prototypes" __attribute__((visibility("default"))) inline const tdb_event *tdb_cursor_next(tdb_cursor *cursor) { if (cursor->num_events_left > 0 || _tdb_cursor_next_batch(cursor)){ const tdb_event *e = (const tdb_event*)cursor->next_event; cursor->next_event += sizeof(tdb_event) + e->num_items * sizeof(tdb_item); --cursor->num_events_left; return e; }else return NULL; } /* Peek the next event in the cursor */ __attribute__((visibility("default"))) inline const tdb_event *tdb_cursor_peek(tdb_cursor *cursor) { if (cursor->num_events_left > 0 || _tdb_cursor_next_batch(cursor)){ return (const tdb_event*)cursor->next_event; }else return NULL; } #pragma GCC diagnostic pop #endif /* __TRAILDB_H__ */ traildb-0.6+dfsg1/src/tdb_error.h0000600000175000017500000000304113106440271016214 0ustar czchenczchen #ifndef __TDB_ERROR_H__ #define __TDB_ERROR_H__ typedef enum{ TDB_ERR_OK = 0, /* generic */ TDB_ERR_NOMEM = -2, TDB_ERR_PATH_TOO_LONG = -3, TDB_ERR_UNKNOWN_FIELD = -4, TDB_ERR_UNKNOWN_UUID = -5, TDB_ERR_INVALID_TRAIL_ID = -6, TDB_ERR_HANDLE_IS_NULL = -7, TDB_ERR_HANDLE_ALREADY_OPENED = -8, TDB_ERR_UNKNOWN_OPTION = -9, TDB_ERR_INVALID_OPTION_VALUE = -10, TDB_ERR_INVALID_UUID = -11, /* io */ TDB_ERR_IO_OPEN = -65, TDB_ERR_IO_CLOSE = -66, TDB_ERR_IO_WRITE = -67, TDB_ERR_IO_READ = -68, TDB_ERR_IO_TRUNCATE = -69, TDB_ERR_IO_PACKAGE = -70, /* tdb_open */ TDB_ERR_INVALID_INFO_FILE = -129, TDB_ERR_INVALID_VERSION_FILE = -130, TDB_ERR_INCOMPATIBLE_VERSION = -131, TDB_ERR_INVALID_FIELDS_FILE = -132, TDB_ERR_INVALID_UUIDS_FILE = -133, TDB_ERR_INVALID_CODEBOOK_FILE = -134, TDB_ERR_INVALID_TRAILS_FILE = -135, TDB_ERR_INVALID_LEXICON_FILE = -136, TDB_ERR_INVALID_PACKAGE = -137, /* tdb_cons */ TDB_ERR_TOO_MANY_FIELDS = -257, TDB_ERR_DUPLICATE_FIELDS = -258, TDB_ERR_INVALID_FIELDNAME = -259, TDB_ERR_TOO_MANY_TRAILS = -260, TDB_ERR_VALUE_TOO_LONG = -261, TDB_ERR_APPEND_FIELDS_MISMATCH = -262, TDB_ERR_LEXICON_TOO_LARGE = -263, TDB_ERR_TIMESTAMP_TOO_LARGE = -264, TDB_ERR_TRAIL_TOO_LONG = -265, /* querying */ TDB_ERR_ONLY_DIFF_FILTER = -513, TDB_ERR_NO_SUCH_ITEM = -514, TDB_ERR_INVALID_RANGE = -515, TDB_ERR_INCORRECT_TERM_TYPE = -516 } tdb_error; #endif /* __TDB_ERROR_H__ */ traildb-0.6+dfsg1/src/tdb_huffman.h0000600000175000017500000000567313106440271016524 0ustar czchenczchen #ifndef __HUFFMAN_H__ #define __HUFFMAN_H__ #include #include "judy_128_map.h" #include "tdb_types.h" #include "tdb_bits.h" #include "tdb_internal.h" /* ensure TDB_CODEBOOK_SIZE < UINT32_MAX */ #define HUFF_CODEBOOK_SIZE 65536 #define HUFF_CODE(x) ((uint16_t)((x) & 65535LU)) #define HUFF_BITS(x) ((uint32_t)(((x) & (65535LU << 16LU)) >> 16LU)) #define HUFF_IS_BIGRAM(x) ((x >> 64) & UINT64_MAX) #define HUFF_BIGRAM_TO_ITEM(x) ((tdb_item)(x & UINT64_MAX)) #define HUFF_BIGRAM_OTHER_ITEM(x) ((tdb_item)(x >> 64)) struct huff_codebook{ __uint128_t symbol; uint32_t bits; } __attribute__((packed)); struct field_stats{ uint32_t field_id_bits; uint32_t field_bits[0]; }; /* ENCODE */ int huff_create_codemap(const struct judy_128_map *gram_freqs, struct judy_128_map *codemap); void huff_encode_grams(const struct judy_128_map *codemap, const __uint128_t *grams, uint64_t num_grams, char *buf, uint64_t *offs, const struct field_stats *fstats); struct huff_codebook *huff_create_codebook(const struct judy_128_map *codemap, uint32_t *size); struct field_stats *huff_field_stats(const uint64_t *field_cardinalities, uint64_t num_fields, uint64_t max_timestamp); static inline uint64_t huff_encoded_max_bits(uint64_t num_grams) { /* how many bits we need in the worst case to encode num_grams? - each gram may be a bigram encoded as two literals (* 2) - each literal takes 1 flag bit, 14 field bits, and 48 value bits in the worst case */ return num_grams * 2 * (1 + 14 + 48); } /* DECODE */ int huff_convert_v0_codebook(struct tdb_file *codebook); /* this may return either an unigram or a bigram */ static inline __uint128_t huff_decode_value(const struct huff_codebook *codebook, const char *data, uint64_t *offset, const struct field_stats *fstats) { /* TODO - we could have a special read_bits for this case */ uint64_t enc = read_bits64(data, *offset, 64); if (enc & 1){ uint16_t idx = HUFF_CODE(enc >> 1); *offset += codebook[idx].bits + 1; return codebook[idx].symbol; }else{ /* read literal: [0 (1 bit) | field-id (field_id_bits) | value (field_bits[field_id])] */ tdb_field field = (tdb_field)((enc >> 1) & ((1LLU << fstats->field_id_bits) - 1)); tdb_val val = (enc >> (fstats->field_id_bits + 1)) & ((1LLU << fstats->field_bits[field]) - 1); *offset += 1 + fstats->field_id_bits + fstats->field_bits[field]; return tdb_make_item(field, val); } } #endif /* __HUFFMAN_H__ */ traildb-0.6+dfsg1/src/arena.h0000600000175000017500000000073213106440271015324 0ustar czchenczchen #ifndef __ARENA_H__ #define __ARENA_H__ #include #include #ifndef ARENA_INCREMENT #define ARENA_INCREMENT 1000000 #endif #define ARENA_DISK_BUFFER (1 << 23) /* must be a power of two */ struct arena{ char *data; uint64_t size; uint64_t next; uint64_t item_size; uint64_t arena_increment; int failed; FILE *fd; }; int arena_flush(const struct arena *a); void *arena_add_item(struct arena *a); #endif /* __ARENA_H__ */ traildb-0.6+dfsg1/src/tdb_internal.h0000600000175000017500000001134413106440271016704 0ustar czchenczchen #ifndef __TDB_INTERNAL_H__ #define __TDB_INTERNAL_H__ #include #include #include "traildb.h" #include "arena.h" #include "judy_str_map.h" #include "judy_128_map.h" #include "tdb_profile.h" #include "tdb_io.h" #define TDB_EXPORT __attribute__((visibility("default"))) /* These are defined by autoconf Nothing has been tested on 32-bit systems so it is better to fail loudly for now. */ struct tdb_cons_event{ uint64_t item_zero; uint64_t num_items; uint64_t timestamp; uint64_t prev_event_idx; }; #define TDB_FILTER_MATCH_ALL 1 #define TDB_FILTER_MATCH_NONE 2 /* Used to build up a CNF expression for filtering events */ struct tdb_event_filter{ uint64_t count; /* number of terms in current clause */ uint64_t size; /* amount of allocated space for items = sizeof(tdb_item) * size */ uint64_t clause_len_idx; /* idx for storing number of terms in current clause */ tdb_item *items; /* array of filters for CNF expression. Each clause in the list starts with a number indicating the number of terms in the clause, following by the terms. Each term starts with a term type flag. For a matching term, a tdb_item containing the field and value to match follows. For time-range filters, the term type flag is followed by two entries, the start and end timestamps. */ uint64_t options; /* MATCH_ALL or MATCH_NONE */ }; /* Flags for types of term comparisons. We currently support two types of terms: matching terms and time-range filters. Terms fall into two types: items and timestamps. If the TIME_RANGE flag is not set, then the item is interpreted as a tdb_item and matched for equality (based on the NEGATED) flag. If the TIME_RANGE flag is set, then the item is interpreted as an uint64_t timestamp and NEGATED is ignored. */ typedef enum { TDB_EVENT_NEGATED = 1, TDB_EVENT_TIME_RANGE = 2 } tdb_event_op_flags; struct tdb_decode_state{ const tdb *db; /* internal buffer */ void *events_buffer; uint64_t events_buffer_len; /* trail state */ uint64_t trail_id; const char *data; uint64_t size; uint64_t offset; uint64_t tstamp; /* options */ const struct tdb_event_filter *filter; int filter_type; int edge_encoded; tdb_item previous_items[0]; }; struct tdb_grouped_event{ uint64_t item_zero; uint64_t num_items; uint64_t timestamp; uint64_t trail_id; }; struct _tdb_cons { char *root; struct arena events; struct arena items; char **ofield_names; uint64_t min_timestamp; uint64_t num_ofields; struct judy_128_map trails; struct judy_str_map *lexicons; char tempfile[TDB_MAX_PATH_SIZE]; /* options */ uint64_t output_format; uint64_t no_bigrams; }; struct tdb_file { char *ptr; const char *data; uint64_t size; uint64_t mmap_size; }; struct tdb_lexicon { uint64_t version; uint64_t size; uint64_t width; union { const uint32_t *toc32; const uint64_t *toc64; } toc; const char *data; }; struct _tdb { uint64_t min_timestamp; uint64_t max_timestamp; uint64_t max_timestamp_delta; uint64_t num_trails; uint64_t num_events; uint64_t num_fields; struct tdb_file uuids; struct tdb_file codebook; struct tdb_file trails; struct tdb_file toc; struct tdb_file *lexicons; char **field_names; struct field_stats *field_stats; uint64_t version; /* tdb_package */ FILE *package_handle; void *package_toc; /* options */ /* TDB_OPT_CURSOR_EVENT_BUFFER_SIZE */ uint64_t opt_cursor_event_buffer_size; /* TDB_OPT_ONLY_DIFF_ITEMS */ int opt_edge_encoded; /* TDB_OPT_EVENT_FILTER */ const struct tdb_event_filter *opt_event_filter; /* trail-level event filters */ Pvoid_t opt_trail_event_filters; }; void tdb_lexicon_read(const tdb *db, tdb_field field, struct tdb_lexicon *lex); const char *tdb_lexicon_get(const struct tdb_lexicon *lex, tdb_val i, uint64_t *length); tdb_error tdb_encode(tdb_cons *cons, const tdb_item *items); tdb_error edge_encode_items(const tdb_item *items, tdb_item **encoded, uint64_t *num_encoded, uint64_t *encoded_size, tdb_item *prev_items, const struct tdb_grouped_event *ev); int file_mmap(const char *path, const char *root, struct tdb_file *dst, const tdb *db); int is_fieldname_invalid(const char* field); #endif /* __TDB_INTERNAL_H__ */ traildb-0.6+dfsg1/src/tdb_bits.h0000600000175000017500000000252713106440271016034 0ustar czchenczchen #ifndef __TDB_BITS_H__ #define __TDB_BITS_H__ /* NOTE: these functions may access extra 7 bytes beyond the end of the destionation. To avoid undefined behavior, make sure there's always a padding of 7 zero bytes after the last byte offset accessed. */ static inline uint64_t read_bits(const char *src, uint64_t offs, uint32_t bits) { /* this assumes that bits <= 48 */ const uint64_t *src_w = (const uint64_t*)&src[offs >> 3]; return (*src_w >> (offs & 7)) & (((1LLU << bits) - 1)); } static inline void write_bits(char *dst, uint64_t offs, uint64_t val) { /* this assumes that (val >> 48) == 0 */ uint64_t *dst_w = (uint64_t*)&dst[offs >> 3]; *dst_w |= ((uint64_t)val) << (offs & 7); } /* TODO benchmark 64-bit versions against a version using __uint128_t */ static inline void write_bits64(char *dst, uint64_t offs, uint64_t val) { write_bits(dst, offs, val); val >>= 48; if (val) write_bits(dst, offs + 48, val); } static inline uint64_t read_bits64(const char *src, uint64_t offs, uint32_t bits) { if (bits > 48){ uint64_t val = read_bits(src, offs + 48, bits - 48); val <<= 48; val |= read_bits(src, offs, 48); return val; }else return read_bits(src, offs, bits); } #endif /* __TDB_BITS_H__ */ traildb-0.6+dfsg1/src/judy_128_map.c0000600000175000017500000000635713106440271016444 0ustar czchenczchen #include #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "judy_128_map.h" void j128m_init(struct judy_128_map *j128m) { memset(j128m, 0, sizeof(struct judy_128_map)); } Word_t *j128m_insert(struct judy_128_map *j128m, __uint128_t key) { uint64_t hi_key = (key >> 64) & UINT64_MAX; uint64_t lo_key = key & UINT64_MAX; Word_t *lo_ptr; Word_t *hi_ptr; Pvoid_t lo_map; /* TODO handle out of memory with Judy - see man 3 judy */ JLI(hi_ptr, j128m->hi_map, hi_key); lo_map = (Pvoid_t)*hi_ptr; JLI(lo_ptr, lo_map, lo_key); *hi_ptr = (Word_t)lo_map; return lo_ptr; out_of_memory: return NULL; } Word_t *j128m_get(const struct judy_128_map *j128m, __uint128_t key) { uint64_t hi_key = (key >> 64) & UINT64_MAX; uint64_t lo_key = key & UINT64_MAX; Word_t *lo_ptr; Word_t *hi_ptr; Pvoid_t lo_map; JLG(hi_ptr, j128m->hi_map, hi_key); if (hi_ptr){ lo_map = (Pvoid_t)*hi_ptr; JLG(lo_ptr, lo_map, lo_key); if (lo_ptr) return lo_ptr; } return NULL; } #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wincompatible-pointer-types" void *j128m_fold(const struct judy_128_map *j128m, judy_128_fold_fn fun, void *state) { uint64_t hi_key = 0; Word_t *hi_ptr; JLF(hi_ptr, j128m->hi_map, hi_key); while (hi_ptr){ Pvoid_t lo_map = (Pvoid_t)*hi_ptr; uint64_t lo_key = 0; Word_t *lo_ptr; JLF(lo_ptr, lo_map, lo_key); while (lo_ptr){ __uint128_t key = hi_key; key <<= 64; key |= lo_key; state = fun(key, lo_ptr, state); JLN(lo_ptr, lo_map, lo_key); } JLN(hi_ptr, j128m->hi_map, hi_key); } return state; out_of_memory: /* this really should be impossible: iterating shouldn't consume extra memory */ fprintf(stderr, "j128m_fold out of memory! this shouldn't happen\n"); exit(1); } #pragma GCC diagnostic pop static void *num_keys_fun(__uint128_t key __attribute__((unused)), Word_t *value __attribute__((unused)), void *state) { ++*(uint64_t*)state; return state; } uint64_t j128m_num_keys(const struct judy_128_map *j128m) { uint64_t count = 0; j128m_fold(j128m, num_keys_fun, &count); return count; } void j128m_free(struct judy_128_map *j128m) { uint64_t hi_key = 0; Word_t *hi_ptr; Word_t tmp; #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsign-compare" #pragma GCC diagnostic ignored "-Wincompatible-pointer-types" JLF(hi_ptr, j128m->hi_map, hi_key); while (hi_ptr){ Pvoid_t lo_map = (Pvoid_t)*hi_ptr; JLFA(tmp, lo_map); JLN(hi_ptr, j128m->hi_map, hi_key); } JLFA(tmp, j128m->hi_map); #pragma GCC diagnostic pop j128m->hi_map = NULL; out_of_memory: return; } traildb-0.6+dfsg1/src/judy_str_map.h0000600000175000017500000000200713106440271016733 0ustar czchenczchen #ifndef __JUDY_STR_MAP_H__ #define __JUDY_STR_MAP_H__ #include #include #include "xxhash/xxhash.h" #define BUFFER_INITIAL_SIZE 65536 typedef void *(*judy_str_fold_fn)(uint64_t id, const char *value, uint64_t length, void *); struct judy_str_map{ char *buffer; uint64_t buffer_offset; uint64_t buffer_size; Pvoid_t large_map; uint64_t num_keys; XXH64_state_t hash_state; }; int jsm_init(struct judy_str_map *jsm); uint64_t jsm_insert(struct judy_str_map *jsm, const char *buf, uint64_t length); uint64_t jsm_get(struct judy_str_map *jsm, const char *buf, uint64_t length); void jsm_free(struct judy_str_map *jsm); void *jsm_fold(const struct judy_str_map *jsm, judy_str_fold_fn fun, void *state); uint64_t jsm_num_keys(const struct judy_str_map *jsm); uint64_t jsm_values_size(const struct judy_str_map *jsm); #endif /* __JUDY_STR_MAP_H__ */ traildb-0.6+dfsg1/src/tdb_io.h0000600000175000017500000000353613106440271015503 0ustar czchenczchen #ifndef __TDB_IO_H__ #define __TDB_IO_H__ #include #include #include #include "tdb_limits.h" #include "tdb_error.h" #define TDB_OPEN(file, path, mode)\ if (!(file = fopen(path, mode))){\ ret = TDB_ERR_IO_OPEN;\ goto done;\ } #define TDB_CLOSE_FINAL(file)\ {\ if (file && fclose(file))\ return TDB_ERR_IO_CLOSE;\ file = NULL;\ } #define TDB_CLOSE(file)\ {\ if (file && fclose(file)){\ ret = TDB_ERR_IO_CLOSE;\ goto done;\ }\ file = NULL;\ } #define TDB_FPRINTF(file, fmt, ...)\ if (fprintf(file, fmt, ##__VA_ARGS__) < 1){\ ret = TDB_ERR_IO_WRITE;\ goto done;\ } #define TDB_READ(file, buf, size)\ if (fread(buf, size, 1, file) != 1){\ ret = TDB_ERR_IO_READ;\ goto done;\ } #define TDB_WRITE(file, buf, size)\ if (fwrite(buf, size, 1, file) != 1){\ ret = TDB_ERR_IO_WRITE;\ goto done;\ } #define TDB_TRUNCATE(file, size)\ if (ftruncate(fileno(file), size)){\ ret = TDB_ERR_IO_TRUNCATE;\ goto done;\ } #define TDB_PATH(path, fmt, ...)\ if (tdb_path(path, fmt, ##__VA_ARGS__)){\ ret = TDB_ERR_PATH_TOO_LONG;\ goto done;\ } #define TDB_SEEK(file, offset)\ if (offset > LONG_MAX || fseek(file, (long)(offset), SEEK_SET) == -1){\ ret = TDB_ERR_IO_WRITE;\ goto done;\ } static int tdb_path(char path[TDB_MAX_PATH_SIZE], char *fmt, ...) __attribute__((unused)); static int tdb_path(char path[TDB_MAX_PATH_SIZE], char *fmt, ...) { va_list aptr; va_start(aptr, fmt); if (vsnprintf(path, TDB_MAX_PATH_SIZE, fmt, aptr) >= TDB_MAX_PATH_SIZE) return TDB_ERR_PATH_TOO_LONG; va_end(aptr); return 0; } #endif /* __TDB_IO_H__ */ traildb-0.6+dfsg1/src/judy_128_map.h0000600000175000017500000000122413106440271016435 0ustar czchenczchen #ifndef __JUDY_128_MAP_H__ #define __JUDY_128_MAP_H__ #include #include typedef void *(*judy_128_fold_fn)(__uint128_t key, Word_t *value, void*); struct judy_128_map{ Pvoid_t hi_map; }; void j128m_init(struct judy_128_map *j128m); Word_t *j128m_insert(struct judy_128_map *j128m, __uint128_t key); Word_t *j128m_get(const struct judy_128_map *j128m, __uint128_t key); void *j128m_fold(const struct judy_128_map *j128m, judy_128_fold_fn fun, void *state); uint64_t j128m_num_keys(const struct judy_128_map *j128m); void j128m_free(struct judy_128_map *j128m); #endif /* __JUDY_128_MAP_H__ */ traildb-0.6+dfsg1/src/tdb_encode.c0000600000175000017500000004167213106440271016327 0ustar czchenczchen#define _DEFAULT_SOURCE /* mkstemp */ #define _GNU_SOURCE #include #include #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "tdb_internal.h" #include "tdb_encode_model.h" #include "tdb_huffman.h" #include "tdb_error.h" #include "tdb_io.h" #define EDGE_INCREMENT 1000000 #define GROUPBUF_INCREMENT 1000000 #define READ_BUFFER_SIZE (1000000 * sizeof(struct tdb_grouped_event)) #define WRITE_BUFFER_SIZE (8 * 1024 * 1024) #define INITIAL_ENCODING_BUF_BITS 8 * 1024 * 1024 struct jm_fold_state{ FILE *grouped_w; struct tdb_grouped_event *buf; uint64_t buf_size; uint64_t trail_id; const struct tdb_cons_event *events; const uint64_t min_timestamp; uint64_t max_timestamp; uint64_t max_timedelta; tdb_error ret; }; static int compare(const void *p1, const void *p2) { const struct tdb_grouped_event *x = (const struct tdb_grouped_event*)p1; const struct tdb_grouped_event *y = (const struct tdb_grouped_event*)p2; if (x->timestamp > y->timestamp) return 1; else if (x->timestamp < y->timestamp) return -1; return 0; } static void *groupby_uuid_handle_one_trail( __uint128_t uuid __attribute__((unused)), Word_t *value, void *state) { struct jm_fold_state *s = (struct jm_fold_state*)state; /* find the last event belonging to this trail */ const struct tdb_cons_event *ev = &s->events[*value - 1]; uint64_t j = 0; uint64_t num_events = 0; int ret = 0; if (s->ret) return s; /* loop through all events belonging to this trail, following back-links */ while (1){ if (j >= s->buf_size){ s->buf_size += GROUPBUF_INCREMENT; if (!(s->buf = realloc(s->buf, s->buf_size * sizeof(struct tdb_grouped_event)))){ ret = TDB_ERR_NOMEM; goto done; } } s->buf[j].trail_id = s->trail_id; s->buf[j].item_zero = ev->item_zero; s->buf[j].num_items = ev->num_items; s->buf[j].timestamp = ev->timestamp; /* TODO write a test for an extra long (>2^32) trail */ if (++j == TDB_MAX_TRAIL_LENGTH){ ret = TDB_ERR_TRAIL_TOO_LONG; goto done; } if (ev->prev_event_idx) ev = &s->events[ev->prev_event_idx - 1]; else break; } num_events = j; /* sort events of this trail by time */ /* TODO make this stable sort */ /* TODO this could really benefit from Timsort since raw data is often partially sorted */ qsort(s->buf, num_events, sizeof(struct tdb_grouped_event), compare); /* delta-encode timestamps */ uint64_t prev_timestamp = s->min_timestamp; for (j = 0; j < num_events; j++){ uint64_t timestamp = s->buf[j].timestamp; uint64_t delta = timestamp - prev_timestamp; if (delta < TDB_MAX_TIMEDELTA){ if (timestamp > s->max_timestamp) s->max_timestamp = timestamp; if (delta > s->max_timedelta) s->max_timedelta = delta; prev_timestamp = timestamp; /* convert the delta value to a proper item */ s->buf[j].timestamp = tdb_make_item(0, delta); }else{ ret = TDB_ERR_TIMESTAMP_TOO_LARGE; goto done; } } TDB_WRITE(s->grouped_w, s->buf, num_events * sizeof(struct tdb_grouped_event)); ++s->trail_id; done: s->ret = ret; return s; } static tdb_error groupby_uuid(FILE *grouped_w, const struct tdb_cons_event *events, tdb_cons *cons, uint64_t *num_trails, uint64_t *max_timestamp, uint64_t *max_timedelta) { struct jm_fold_state state = { .grouped_w = grouped_w, .events = events, .min_timestamp = cons->min_timestamp }; /* we require (min_timestamp - 0 < TDB_MAX_TIMEDELTA) */ if (cons->min_timestamp >= TDB_MAX_TIMEDELTA) return TDB_ERR_TIMESTAMP_TOO_LARGE; j128m_fold(&cons->trails, groupby_uuid_handle_one_trail, &state); *num_trails = state.trail_id; *max_timestamp = state.max_timestamp; *max_timedelta = state.max_timedelta; free(state.buf); return state.ret; } tdb_error edge_encode_items(const tdb_item *items, tdb_item **encoded, uint64_t *num_encoded, uint64_t *encoded_size, tdb_item *prev_items, const struct tdb_grouped_event *ev) { uint64_t n = 0; uint64_t j = ev->item_zero; /* edge encode items: keep only fields that are different from the previous event */ for (; j < ev->item_zero + ev->num_items; j++){ tdb_field field = tdb_item_field(items[j]); if (prev_items[field] != items[j]){ if (n == *encoded_size){ *encoded_size += EDGE_INCREMENT; if (!(*encoded = realloc(*encoded, *encoded_size * sizeof(tdb_item)))) return TDB_ERR_NOMEM; } (*encoded)[n++] = prev_items[field] = items[j]; } } *num_encoded = n; return 0; } static tdb_error store_info(const char *path, uint64_t num_trails, uint64_t num_events, uint64_t min_timestamp, uint64_t max_timestamp, uint64_t max_timedelta) { FILE *out = NULL; int ret = 0; /* NOTE - this file shouldn't grow to be more than 512 bytes, so it occupies a constant amount of space in a tar package. */ TDB_OPEN(out, path, "w"); TDB_FPRINTF(out, "%"PRIu64" %"PRIu64" %"PRIu64" %"PRIu64" %"PRIu64"\n", num_trails, num_events, min_timestamp, max_timestamp, max_timedelta); done: TDB_CLOSE_FINAL(out); return ret; } static tdb_error encode_trails(const tdb_item *items, FILE *grouped, uint64_t num_events, uint64_t num_trails, uint64_t num_fields, const struct judy_128_map *codemap, const struct judy_128_map *gram_freqs, const struct field_stats *fstats, const char *path, const char *toc_path) { __uint128_t *grams = NULL; tdb_item *prev_items = NULL; uint64_t *encoded = NULL; uint64_t encoded_size = 0; uint64_t buf_size = INITIAL_ENCODING_BUF_BITS; uint64_t i = 1; char *buf = NULL; FILE *out = NULL; uint64_t file_offs = 0; uint64_t *toc = NULL; struct gram_bufs gbufs; struct tdb_grouped_event ev; int ret = 0; char *write_buf = NULL; if ((ret = init_gram_bufs(&gbufs, num_fields))) goto done; if (!(write_buf = malloc(WRITE_BUFFER_SIZE))){ ret = TDB_ERR_NOMEM; goto done; } TDB_OPEN(out, path, "w"); setvbuf(out, write_buf, _IOFBF, WRITE_BUFFER_SIZE); if (!(buf = calloc(1, buf_size / 8 + 8))){ ret = TDB_ERR_NOMEM; goto done; } if (!(prev_items = malloc(num_fields * sizeof(tdb_item)))){ ret = TDB_ERR_NOMEM; goto done; } if (!(grams = malloc(num_fields * 16))){ ret = TDB_ERR_NOMEM; goto done; } if (!(toc = malloc((num_trails + 1) * 8))){ ret = TDB_ERR_NOMEM; goto done; } rewind(grouped); if (num_events) TDB_READ(grouped, &ev, sizeof(struct tdb_grouped_event)); while (i <= num_events){ /* encode trail for one UUID (multiple events) */ /* reserve 3 bits in the head of the trail for a length residual: Length of a trail is measured in bytes but the last byte may be short. The residual indicates how many bits in the end we should ignore. */ uint64_t offs = 3; uint64_t trail_id = ev.trail_id; uint64_t n, m, trail_size; toc[trail_id] = file_offs; memset(prev_items, 0, num_fields * sizeof(tdb_item)); while (ev.trail_id == trail_id){ /* 1) produce an edge-encoded set of items for this event */ if ((ret = edge_encode_items(items, &encoded, &n, &encoded_size, prev_items, &ev))) goto done; /* 2) cover the encoded set with a set of unigrams and bigrams */ if ((ret = choose_grams_one_event(encoded, n, gram_freqs, &gbufs, grams, &m, &ev))) goto done; uint64_t bits_needed = offs + huff_encoded_max_bits(m) + 64; if (bits_needed > buf_size){ char *new_buf; buf_size = bits_needed * 2; if (!(new_buf = calloc(1, buf_size / 8 + 8))){ ret = TDB_ERR_NOMEM; goto done; } memcpy(new_buf, buf, offs / 8 + 1); free(buf); buf = new_buf; } /* 3) huffman-encode grams */ huff_encode_grams(codemap, grams, m, buf, &offs, fstats); if (i++ < num_events){ TDB_READ(grouped, &ev, sizeof(struct tdb_grouped_event)); }else break; } /* write the length residual */ if (offs & 7){ trail_size = offs / 8 + 1; write_bits(buf, 0, 8 - (uint32_t)(offs & 7LLU)); }else{ trail_size = offs / 8; } /* append trail to the end of file */ TDB_WRITE(out, buf, trail_size); file_offs += trail_size; memset(buf, 0, trail_size); } /* keep the redundant last offset in the TOC, so we can determine trail length with toc[i + 1] - toc[i]. */ toc[num_trails] = file_offs; /* write an extra 8 null bytes: huffman may require up to 7 when reading */ uint64_t zero = 0; TDB_WRITE(out, &zero, 8); file_offs += 8; TDB_CLOSE(out); TDB_OPEN(out, toc_path, "w"); size_t offs_size = file_offs < UINT32_MAX ? 4 : 8; for (i = 0; i < num_trails + 1; i++) TDB_WRITE(out, &toc[i], offs_size); done: TDB_CLOSE_FINAL(out); free(write_buf); free_gram_bufs(&gbufs); free(grams); free(encoded); free(prev_items); free(buf); free(toc); return ret; } static tdb_error store_codebook(const struct judy_128_map *codemap, const char *path) { FILE *out = NULL; uint32_t size; struct huff_codebook *book = huff_create_codebook(codemap, &size); int ret = 0; TDB_OPEN(out, path, "w"); TDB_WRITE(out, book, size); done: TDB_CLOSE_FINAL(out); free(book); return ret; } tdb_error tdb_encode(tdb_cons *cons, const tdb_item *items) { char path[TDB_MAX_PATH_SIZE]; char grouped_path[TDB_MAX_PATH_SIZE]; char toc_path[TDB_MAX_PATH_SIZE]; char *root = cons->root; char *read_buf = NULL; struct field_stats *fstats = NULL; uint64_t num_trails = 0; uint64_t num_events = cons->events.next; uint64_t num_fields = cons->num_ofields + 1; uint64_t max_timestamp = 0; uint64_t max_timedelta = 0; uint64_t *field_cardinalities = NULL; uint64_t i; Pvoid_t unigram_freqs = NULL; struct judy_128_map gram_freqs; struct judy_128_map codemap; Word_t tmp; FILE *grouped_w = NULL; FILE *grouped_r = NULL; int fd, ret = 0; TDB_TIMER_DEF j128m_init(&gram_freqs); j128m_init(&codemap); if (!(field_cardinalities = calloc(cons->num_ofields, 8))){ ret = TDB_ERR_NOMEM; goto done; } for (i = 0; i < cons->num_ofields; i++) field_cardinalities[i] = jsm_num_keys(&cons->lexicons[i]); /* 1. group events by trail, sort events of each trail by time, and delta-encode timestamps */ TDB_TIMER_START TDB_PATH(grouped_path, "%s/tmp.grouped.XXXXXX", root); if ((fd = mkstemp(grouped_path)) == -1){ ret = TDB_ERR_IO_OPEN; goto done; } if (!(grouped_w = fdopen(fd, "w"))){ ret = TDB_ERR_IO_OPEN; goto done; } if (cons->events.data) if ((ret = groupby_uuid(grouped_w, (struct tdb_cons_event*)cons->events.data, cons, &num_trails, &max_timestamp, &max_timedelta))) goto done; /* not the most clean separation of ownership here, but these objects can be huge so keeping them around unecessarily is expensive */ free(cons->events.data); cons->events.data = NULL; j128m_free(&cons->trails); TDB_CLOSE(grouped_w); grouped_w = NULL; TDB_OPEN(grouped_r, grouped_path, "r"); if (!(read_buf = malloc(READ_BUFFER_SIZE))){ ret = TDB_ERR_NOMEM; goto done; } setvbuf(grouped_r, read_buf, _IOFBF, READ_BUFFER_SIZE); TDB_TIMER_END("trail/groupby_uuid"); /* 2. store metatadata */ TDB_TIMER_START TDB_PATH(path, "%s/info", root); if ((ret = store_info(path, num_trails, num_events, cons->min_timestamp, max_timestamp, max_timedelta))) goto done; TDB_TIMER_END("trail/info"); /* 3. collect value (unigram) freqs, including delta-encoded timestamps */ TDB_TIMER_START unigram_freqs = collect_unigrams(grouped_r, num_events, items, num_fields); if (num_events > 0 && !unigram_freqs){ ret = TDB_ERR_NOMEM; goto done; } TDB_TIMER_END("trail/collect_unigrams"); /* 4. construct uni/bi-grams */ tdb_opt_value dont_build_bigrams; tdb_cons_get_opt(cons, TDB_OPT_CONS_NO_BIGRAMS, &dont_build_bigrams); TDB_TIMER_START if ((ret = make_grams(grouped_r, num_events, items, num_fields, unigram_freqs, &gram_freqs, dont_build_bigrams.value))) goto done; TDB_TIMER_END("trail/gram_freqs"); /* 5. build a huffman codebook and stats struct for encoding grams */ TDB_TIMER_START if ((ret = huff_create_codemap(&gram_freqs, &codemap))) goto done; if (!(fstats = huff_field_stats(field_cardinalities, num_fields, max_timedelta))){ ret = TDB_ERR_NOMEM; goto done; } TDB_TIMER_END("trail/huff_create_codemap"); /* 6. encode and write trails to disk */ TDB_TIMER_START TDB_PATH(path, "%s/trails.data", root); TDB_PATH(toc_path, "%s/trails.toc", root); if ((ret = encode_trails(items, grouped_r, num_events, num_trails, num_fields, &codemap, &gram_freqs, fstats, path, toc_path))) goto done; TDB_TIMER_END("trail/encode_trails"); /* 7. write huffman codebook to disk */ TDB_TIMER_START tdb_path(path, "%s/trails.codebook", root); if ((ret = store_codebook(&codemap, path))) goto done; TDB_TIMER_END("trail/store_codebook"); done: TDB_CLOSE_FINAL(grouped_w); TDB_CLOSE_FINAL(grouped_r); j128m_free(&gram_freqs); j128m_free(&codemap); #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsign-compare" JLFA(tmp, unigram_freqs); #pragma GCC diagnostic pop unlink(grouped_path); free(field_cardinalities); free(read_buf); free(fstats); return ret; out_of_memory: return TDB_ERR_NOMEM; } traildb-0.6+dfsg1/src/tdb_encode_model.h0000600000175000017500000000244113106440271017503 0ustar czchenczchen #ifndef __TDB_ENCODE_MODEL_H__ #define __TDB_ENCODE_MODEL_H__ #include #include #include #include "tdb_types.h" #include "judy_128_map.h" struct gram_bufs{ __uint128_t *chosen; uint64_t *scores; /* size of the above two */ uint64_t buf_len; uint8_t *covered; uint64_t num_fields; }; int init_gram_bufs(struct gram_bufs *b, uint64_t num_fields); void free_gram_bufs(struct gram_bufs *b); int choose_grams_one_event(const tdb_item *encoded, uint64_t num_encoded, const struct judy_128_map *gram_freqs, struct gram_bufs *g, __uint128_t *grams, uint64_t *num_grams, const struct tdb_grouped_event *ev); int make_grams(FILE *grouped, uint64_t num_events, const tdb_item *items, uint64_t num_fields, const Pvoid_t unigram_freqs, struct judy_128_map *final_freqs, uint64_t no_bigrams); Pvoid_t collect_unigrams(FILE *grouped, uint64_t num_events, const tdb_item *items, uint64_t num_fields); #endif /* __TDB_ENCODE_MODEL_H__ */ traildb-0.6+dfsg1/src/judy_str_map.c0000600000175000017500000001220113106440271016723 0ustar czchenczchen #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "judy_str_map.h" #define MAX_NUM_RETRIES 16 struct jsm_item{ uint64_t id; uint64_t length; char value[0]; } __attribute__((packed)); static uint64_t jsm_get_large(struct judy_str_map *jsm, const char *buf, uint64_t length, uint32_t num_retries) { Word_t *ptr; Word_t key; XXH64_reset(&jsm->hash_state, num_retries + 1); XXH64_update(&jsm->hash_state, buf, length); key = XXH64_digest(&jsm->hash_state); JLG(ptr, jsm->large_map, key); if (ptr){ const struct jsm_item *item_ro = (const struct jsm_item*)&jsm->buffer[*ptr - 1]; if (item_ro->length == length && !memcmp(item_ro->value, buf, length)) return item_ro->id; else if (++num_retries < MAX_NUM_RETRIES) return jsm_get_large(jsm, buf, length, num_retries); } return 0; } static uint64_t jsm_insert_large(struct judy_str_map *jsm, const char *buf, uint64_t length, uint32_t num_retries) { Word_t *ptr; Word_t key; XXH64_reset(&jsm->hash_state, num_retries + 1); XXH64_update(&jsm->hash_state, buf, length); key = XXH64_digest(&jsm->hash_state); JLI(ptr, jsm->large_map, key); if (*ptr){ const struct jsm_item *item_ro = (const struct jsm_item*)&jsm->buffer[*ptr - 1]; if (item_ro->length == length && !memcmp(item_ro->value, buf, length)) return item_ro->id; else{ if (++num_retries < MAX_NUM_RETRIES) return jsm_insert_large(jsm, buf, length, num_retries); else{ fprintf(stderr, "All hash lookups failed for a key of size %" PRIu64". Very strange!\n", length); return 0; } } }else{ struct jsm_item item; if (jsm->buffer_offset + length + sizeof(item) > jsm->buffer_size){ while (jsm->buffer_offset + length + sizeof(item) > jsm->buffer_size) jsm->buffer_size *= 2; if (!(jsm->buffer = realloc(jsm->buffer, jsm->buffer_size))) return 0; } *ptr = jsm->buffer_offset + 1; item.id = ++jsm->num_keys; item.length = length; memcpy(&jsm->buffer[jsm->buffer_offset], &item, sizeof(item)); jsm->buffer_offset += sizeof(item); memcpy(&jsm->buffer[jsm->buffer_offset], buf, length); jsm->buffer_offset += length; return item.id; } out_of_memory: return 0; } /* fold must return IDs in the ascending order, e.g store_lexicon() relies on this */ void *jsm_fold(const struct judy_str_map *jsm, judy_str_fold_fn fun, void *state) { uint64_t offset = 0; while (offset < jsm->buffer_offset){ const struct jsm_item *item = (const struct jsm_item*)&jsm->buffer[offset]; state = fun(item->id, item->value, item->length, state); offset += item->length + sizeof(struct jsm_item); } return state; } uint64_t jsm_insert(struct judy_str_map *jsm, const char *buf, uint64_t length) { if (length == 0) return 0; return jsm_insert_large(jsm, buf, length, 0); } uint64_t jsm_get(struct judy_str_map *jsm, const char *buf, uint64_t length) { if (length == 0) return 0; return jsm_get_large(jsm, buf, length, 0); } int jsm_init(struct judy_str_map *jsm) { memset(jsm, 0, sizeof(struct judy_str_map)); jsm->buffer_size = BUFFER_INITIAL_SIZE; if (!(jsm->buffer = malloc(jsm->buffer_size))) return 1; return 0; } void jsm_free(struct judy_str_map *jsm) { Word_t tmp; #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsign-compare" JLFA(tmp, jsm->large_map); #pragma GCC diagnostic pop free(jsm->buffer); out_of_memory: return; } uint64_t jsm_num_keys(const struct judy_str_map *jsm) { return jsm->num_keys; } uint64_t jsm_values_size(const struct judy_str_map *jsm) { return jsm->buffer_offset - jsm->num_keys * sizeof(struct jsm_item); } #ifdef JSM_MAIN int main(int argc, char **argv) { void *print_key(uint64_t id, const char *value, uint64_t len, void *state) { printf("%s", value); return NULL; } FILE *in = fopen(argv[1], "r"); char *line = NULL; size_t len = 0; struct judy_str_map jsm; ssize_t read; jsm_init(&jsm); while ((read = getline(&line, &len, in)) != -1) jsm_insert(&jsm, line, read + 1); fprintf(stderr, "Found %"PRIu64" unique lines\n", jsm_num_keys(&jsm)); jsm_fold(&jsm, print_key, NULL); return 0; } #endif traildb-0.6+dfsg1/src/tdb_queue.h0000600000175000017500000000062613106440271016215 0ustar czchenczchen#ifndef __TDB_QUEUE__ #define __TDB_QUEUE__ #include struct tdb_queue; struct tdb_queue *tdb_queue_new(uint32_t max_length); void tdb_queue_free(struct tdb_queue *q); void tdb_queue_push(struct tdb_queue *q, void *e); void *tdb_queue_pop(struct tdb_queue *q); void *tdb_queue_peek(const struct tdb_queue *q); uint32_t tdb_queue_length(const struct tdb_queue *q); #endif /* __TDB_QUEUE__ */ traildb-0.6+dfsg1/src/tdb_cons.c0000600000175000017500000005205713106440271016033 0ustar czchenczchen#define _DEFAULT_SOURCE /* ftruncate() */ #define _GNU_SOURCE #include #include #include #include #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "judy_str_map.h" #include "tdb_internal.h" #include "tdb_error.h" #include "tdb_io.h" #include "tdb_package.h" #include "arena.h" #ifndef EVENTS_ARENA_INCREMENT #define EVENTS_ARENA_INCREMENT 1000000 #endif struct jm_fold_state{ FILE *out; uint64_t offset; tdb_error ret; uint64_t width; }; static void *lexicon_store_fun(uint64_t id, const char *value, uint64_t len, void *state) { struct jm_fold_state *s = (struct jm_fold_state*)state; int ret = 0; if (s->ret) return state; /* NOTE: vals start at 1, otherwise we would need to +1 */ TDB_SEEK(s->out, id * s->width); TDB_WRITE(s->out, &s->offset, s->width); TDB_SEEK(s->out, s->offset); TDB_WRITE(s->out, value, len); done: s->ret = ret; s->offset += len; return state; } static tdb_error lexicon_store(const struct judy_str_map *lexicon, const char *path) { /* Lexicon format: [ number of values N ] 4 or 8 bytes [ value offsets ... ] N * (4 or 8 bytes) [ last value offset ] 4 or 8 bytes [ values ... ] X bytes */ struct jm_fold_state state; uint64_t count = jsm_num_keys(lexicon); uint64_t size = (count + 2) * 4 + jsm_values_size(lexicon); int ret = 0; state.offset = (count + 2) * 4; state.width = 4; if (size > UINT32_MAX){ size = (count + 2) * 8 + jsm_values_size(lexicon); state.offset = (count + 2) * 8; state.width = 8; } if (size > TDB_MAX_LEXICON_SIZE) return TDB_ERR_LEXICON_TOO_LARGE; state.out = NULL; state.ret = 0; TDB_OPEN(state.out, path, "w"); TDB_TRUNCATE(state.out, (off_t)size); TDB_WRITE(state.out, &count, state.width); jsm_fold(lexicon, lexicon_store_fun, &state); if ((ret = state.ret)) goto done; TDB_SEEK(state.out, (count + 1) * state.width); TDB_WRITE(state.out, &state.offset, state.width); done: TDB_CLOSE_FINAL(state.out); return ret; } static tdb_error store_lexicons(tdb_cons *cons) { tdb_field i; FILE *out = NULL; char path[TDB_MAX_PATH_SIZE]; int ret = 0; TDB_PATH(path, "%s/fields", cons->root); TDB_OPEN(out, path, "w"); for (i = 0; i < cons->num_ofields; i++){ TDB_PATH(path, "%s/lexicon.%s", cons->root, cons->ofield_names[i]); if ((ret = lexicon_store(&cons->lexicons[i], path))) goto done; TDB_FPRINTF(out, "%s\n", cons->ofield_names[i]); } TDB_FPRINTF(out, "\n"); done: TDB_CLOSE_FINAL(out); return ret; } static tdb_error store_version(tdb_cons *cons) { FILE *out = NULL; char path[TDB_MAX_PATH_SIZE]; int ret = 0; TDB_PATH(path, "%s/version", cons->root); TDB_OPEN(out, path, "w"); TDB_FPRINTF(out, "%llu", TDB_VERSION_LATEST); done: TDB_CLOSE_FINAL(out); return ret; } static void *store_uuids_fun(__uint128_t key, Word_t *value __attribute__((unused)), void *state) { struct jm_fold_state *s = (struct jm_fold_state*)state; int ret = 0; TDB_WRITE(s->out, &key, 16); done: s->ret = ret; return s; } static tdb_error store_uuids(tdb_cons *cons) { char path[TDB_MAX_PATH_SIZE]; struct jm_fold_state state = {.ret = 0}; uint64_t num_trails = j128m_num_keys(&cons->trails); int ret = 0; /* this is why num_trails < TDB_MAX)NUM_TRAILS < 2^59: (2^59 - 1) * 16 < LONG_MAX (off_t) */ if (num_trails > TDB_MAX_NUM_TRAILS) return TDB_ERR_TOO_MANY_TRAILS; TDB_PATH(path, "%s/uuids", cons->root); TDB_OPEN(state.out, path, "w"); TDB_TRUNCATE(state.out, ((off_t)(num_trails * 16))); j128m_fold(&cons->trails, store_uuids_fun, &state); ret = state.ret; done: TDB_CLOSE_FINAL(state.out); return ret; } int is_fieldname_invalid(const char* field) { uint64_t i; if (!strcmp(field, "time")) return 1; for (i = 0; i < TDB_MAX_FIELDNAME_LENGTH && field[i]; i++) if (!index(TDB_FIELDNAME_CHARS, field[i])) return 1; if (i == 0 || i == TDB_MAX_FIELDNAME_LENGTH) return 1; return 0; } static tdb_error find_duplicate_fieldnames(const char **ofield_names, uint64_t num_ofields) { Pvoid_t check = NULL; tdb_field i; Word_t tmp; #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsign-compare" for (i = 0; i < num_ofields; i++){ Word_t *ptr; JSLI(ptr, check, (const uint8_t*)ofield_names[i]); if (*ptr){ JSLFA(tmp, check); return TDB_ERR_DUPLICATE_FIELDS; } *ptr = 1; } JSLFA(tmp, check); #pragma GCC diagnostic pop return 0; out_of_memory: return TDB_ERR_NOMEM; } TDB_EXPORT tdb_cons *tdb_cons_init(void) { tdb_cons *c = calloc(1, sizeof(tdb_cons)); if (c){ /* this will fail if libarchive is not found but it is ok, we just fall back to the directory mode */ tdb_cons_set_opt(c, TDB_OPT_CONS_OUTPUT_FORMAT, opt_val(TDB_OPT_CONS_OUTPUT_FORMAT_PACKAGE)); } return c; } TDB_EXPORT tdb_error tdb_cons_open(tdb_cons *cons, const char *root, const char **ofield_names, uint64_t num_ofields) { tdb_field i; int fd; int ret = 0; /* by handling the "cons == NULL" case here gracefully, we allow the return value of tdb_init() to be used unchecked like here: int err; tdb_cons *cons = tdb_cons_init(); if ((err = tdb_cons_open(cons, path, fields, num_fields))) printf("Opening cons failed: %s", tdb_error(err)); */ if (!cons) return TDB_ERR_HANDLE_IS_NULL; if (cons->events.item_size) return TDB_ERR_HANDLE_ALREADY_OPENED; if (num_ofields > TDB_MAX_NUM_FIELDS) return TDB_ERR_TOO_MANY_FIELDS; if ((ret = find_duplicate_fieldnames(ofield_names, num_ofields))) goto done; if (!(cons->ofield_names = calloc(num_ofields, sizeof(char*)))) return TDB_ERR_NOMEM; for (i = 0; i < num_ofields; i++){ if (is_fieldname_invalid(ofield_names[i])){ ret = TDB_ERR_INVALID_FIELDNAME; goto done; } if (!(cons->ofield_names[i] = strdup(ofield_names[i]))){ ret = TDB_ERR_NOMEM; goto done; } } j128m_init(&cons->trails); if (!(cons->root = strdup(root))){ ret = TDB_ERR_NOMEM; goto done; } cons->min_timestamp = UINT64_MAX; cons->num_ofields = num_ofields; cons->events.arena_increment = EVENTS_ARENA_INCREMENT; cons->events.item_size = sizeof(struct tdb_cons_event); cons->items.item_size = sizeof(tdb_item); /* Opportunistically try to create the output directory. We don't care if it fails, e.g. because it already exists */ mkdir(root, 0755); TDB_PATH(cons->tempfile, "%s/tmp.items.XXXXXX", root); if ((fd = mkstemp(cons->tempfile)) == -1){ ret = TDB_ERR_IO_OPEN; goto done; } if (!(cons->items.fd = fdopen(fd, "w"))){ ret = TDB_ERR_IO_OPEN; goto done; } if (cons->num_ofields > 0) if (!(cons->lexicons = calloc(cons->num_ofields, sizeof(struct judy_str_map)))){ ret = TDB_ERR_NOMEM; goto done; } for (i = 0; i < cons->num_ofields; i++) if (jsm_init(&cons->lexicons[i])){ ret = TDB_ERR_NOMEM; goto done; } done: return ret; } TDB_EXPORT void tdb_cons_close(tdb_cons *cons) { if(cons){ uint64_t i; for (i = 0; i < cons->num_ofields; i++){ if (cons->ofield_names) free(cons->ofield_names[i]); if (cons->lexicons) jsm_free(&cons->lexicons[i]); } free(cons->lexicons); if (cons->items.fd) fclose(cons->items.fd); if (cons->events.data) free(cons->events.data); if (cons->items.data) free(cons->items.data); j128m_free(&cons->trails); free(cons->ofield_names); free(cons->root); free(cons); } } /* Append an event in this cons. */ TDB_EXPORT tdb_error tdb_cons_add(tdb_cons *cons, const uint8_t uuid[16], const uint64_t timestamp, const char **values, const uint64_t *value_lengths) { tdb_field i; struct tdb_cons_event *event; Word_t *uuid_ptr; __uint128_t uuid_key; for (i = 0; i < cons->num_ofields; i++) if (value_lengths[i] > TDB_MAX_VALUE_SIZE) return TDB_ERR_VALUE_TOO_LONG; memcpy(&uuid_key, uuid, 16); uuid_ptr = j128m_insert(&cons->trails, uuid_key); if (!(event = (struct tdb_cons_event*)arena_add_item(&cons->events))) return TDB_ERR_NOMEM; event->item_zero = cons->items.next; event->num_items = 0; event->timestamp = timestamp; event->prev_event_idx = *uuid_ptr; *uuid_ptr = cons->events.next; if (timestamp < cons->min_timestamp) cons->min_timestamp = timestamp; for (i = 0; i < cons->num_ofields; i++){ tdb_field field = (tdb_field)(i + 1); tdb_val val = 0; tdb_item item; void *dst; if (value_lengths[i]){ if (!(val = (tdb_val)jsm_insert(&cons->lexicons[i], values[i], value_lengths[i]))) return TDB_ERR_NOMEM; } item = tdb_make_item(field, val); if (!(dst = arena_add_item(&cons->items))) /* cons->items is a file-backed arena, so this is most likely caused by disk being full, hence an IO error. */ return TDB_ERR_IO_WRITE; memcpy(dst, &item, sizeof(tdb_item)); ++event->num_items; } return 0; } /* this function adds events from db to cons one by one, using the public API. We need to use this with filtered dbs or otherwise when we need to re-create lexicons. */ static tdb_error tdb_cons_append_subset_lexicon(tdb_cons *cons, const tdb *db) { const char **values = NULL; uint64_t *lengths = NULL; uint64_t i, trail_id; int ret = 0; const uint64_t num_fields = tdb_num_fields(db); tdb_cursor *cursor = tdb_cursor_new(db); if (!cursor) return TDB_ERR_NOMEM; if (!(values = malloc(num_fields * sizeof(char*)))){ ret = TDB_ERR_NOMEM; goto done; } if (!(lengths = malloc(num_fields * sizeof(uint64_t)))){ ret = TDB_ERR_NOMEM; goto done; } for (trail_id = 0; trail_id < tdb_num_trails(db); trail_id++){ const tdb_event *event; if ((ret = tdb_get_trail(cursor, trail_id))) goto done; /* lookup UUID only if there are events: expensive to perform many unnecessary lookups with selective filters */ if (tdb_cursor_peek(cursor)){ const uint8_t *uuid = tdb_get_uuid(db, trail_id); while ((event = tdb_cursor_next(cursor))){ /* with TDB_OPT_ONLY_DIFF_ITEMS event->items may be sparse, hence we need to reset lengths to zero */ memset(lengths, 0, num_fields * sizeof(uint64_t)); for (i = 0; i < event->num_items; i++){ tdb_field field = tdb_item_field(event->items[i]); tdb_val val = tdb_item_val(event->items[i]); values[field - 1] = tdb_get_value(db, field, val, &lengths[field - 1]); } if ((ret = tdb_cons_add(cons, uuid, event->timestamp, values, lengths))) goto done; } } } done: free(values); free(lengths); tdb_cursor_free(cursor); return ret; } /* Append the lexicons of an existing TrailDB, db, to this cons. Used by tdb_cons_append(). */ static uint64_t **append_lexicons(tdb_cons *cons, const tdb *db) { tdb_val **lexicon_maps; tdb_val i; tdb_field field; if (!(lexicon_maps = calloc(cons->num_ofields, sizeof(tdb_val*)))) return NULL; for (field = 0; field < cons->num_ofields; field++){ struct tdb_lexicon lex; uint64_t *map; tdb_lexicon_read(db, field + 1, &lex); if (!(map = lexicon_maps[field] = malloc(lex.size * sizeof(tdb_val)))) goto error; for (i = 0; i < lex.size; i++){ uint64_t value_length; const char *value = tdb_lexicon_get(&lex, i, &value_length); tdb_val val; if ((val = (tdb_val)jsm_insert(&cons->lexicons[field], value, value_length))) map[i] = val; else goto error; } } return lexicon_maps; error: for (i = 0; i < cons->num_ofields; i++) free(lexicon_maps[i]); free(lexicon_maps); return NULL; } /* Take an event from the old db, translate its items to new vals and append to the new cons */ static tdb_error append_event(tdb_cons *cons, const tdb_event *event, Word_t *uuid_ptr, tdb_val **lexicon_maps) { uint64_t i; struct tdb_cons_event *new_event = (struct tdb_cons_event*)arena_add_item(&cons->events); if (!new_event) return TDB_ERR_NOMEM; new_event->item_zero = cons->items.next; new_event->num_items = 0; new_event->timestamp = event->timestamp; new_event->prev_event_idx = *uuid_ptr; *uuid_ptr = cons->events.next; for (i = 0; i < event->num_items; i++){ tdb_val val = tdb_item_val(event->items[i]); tdb_field field = tdb_item_field(event->items[i]); tdb_val new_val = 0; /* translate val */ if (val) new_val = lexicon_maps[field - 1][val - 1]; tdb_item item = tdb_make_item(field, new_val); void *dst = arena_add_item(&cons->items); if (!dst) /* cons->items is a file-backed arena, so this is most likely caused by disk being full, hence an IO error. */ return TDB_ERR_IO_WRITE; memcpy(dst, &item, sizeof(tdb_item)); ++new_event->num_items; } return TDB_ERR_OK; } /* this function is an optimized version of tdb_cons_append_subset_lexicon(): instead of mapping items to strings and back, we know that all entries from the lexicon will be needed, so we can merge the lexicons and add remap items in db to items in cons, without going through strings. */ static tdb_error tdb_cons_append_full_lexicon(tdb_cons *cons, const tdb *db) { tdb_val **lexicon_maps = NULL; uint64_t i, trail_id; int ret = 0; tdb_cursor *cursor = tdb_cursor_new(db); if (!cursor) return TDB_ERR_NOMEM; if (db->min_timestamp < cons->min_timestamp) cons->min_timestamp = db->min_timestamp; if (!(lexicon_maps = append_lexicons(cons, db))){ ret = TDB_ERR_NOMEM; goto done; } for (trail_id = 0; trail_id < tdb_num_trails(db); trail_id++){ __uint128_t uuid_key; Word_t *uuid_ptr; const tdb_event *event; if ((ret = tdb_get_trail(cursor, trail_id))) goto done; /* lookup UUID only if there are events: expensive to perform many unnecessary lookups with selective filters */ if (tdb_cursor_peek(cursor)){ memcpy(&uuid_key, tdb_get_uuid(db, trail_id), 16); uuid_ptr = j128m_insert(&cons->trails, uuid_key); while ((event = tdb_cursor_next(cursor))) if ((ret = append_event(cons, event, uuid_ptr, lexicon_maps))) goto done; } } done: if (lexicon_maps){ for (i = 0; i < cons->num_ofields; i++) free(lexicon_maps[i]); free(lexicon_maps); } tdb_cursor_free(cursor); return ret; } /* Merge an existing tdb to the new cons. */ TDB_EXPORT tdb_error tdb_cons_append(tdb_cons *cons, const tdb *db) { tdb_field field; /* NOTE we could be much more permissive with what can be joined: we could support "full outer join" and replace all missing fields with NULLs automatically. */ if (cons->num_ofields != db->num_fields - 1) return TDB_ERR_APPEND_FIELDS_MISMATCH; for (field = 0; field < cons->num_ofields; field++) if (strcmp(cons->ofield_names[field], tdb_get_field_name(db, field + 1))) return TDB_ERR_APPEND_FIELDS_MISMATCH; /* NOTE: When you add new options in tdb, remember to add them to the list below if they cause only a subset of events to be returned. */ if (db->opt_event_filter || db->opt_edge_encoded || db->opt_trail_event_filters) /* Standard append: recreate lexicons through strings. We need to do this when only a subset of events is appended. */ return tdb_cons_append_subset_lexicon(cons, db); else /* Optimized append: merge lexicons, remap items. We can do this when all events are appended. */ return tdb_cons_append_full_lexicon(cons, db); } TDB_EXPORT tdb_error tdb_cons_finalize(tdb_cons *cons) { struct tdb_file items_mmapped; uint64_t num_events = cons->events.next; int ret = 0; memset(&items_mmapped, 0, sizeof(struct tdb_file)); /* finalize event items */ if ((ret = arena_flush(&cons->items))) goto done; if (cons->items.fd && fclose(cons->items.fd)) { cons->items.fd = NULL; ret = TDB_ERR_IO_CLOSE; goto done; } cons->items.fd = NULL; if (cons->tempfile[0]){ if (num_events && cons->num_ofields) { if (file_mmap(cons->tempfile, NULL, &items_mmapped, NULL)){ ret = TDB_ERR_IO_READ; goto done; } } TDB_TIMER_DEF TDB_TIMER_START if ((ret = store_lexicons(cons))) goto done; TDB_TIMER_END("encoder/store_lexicons") TDB_TIMER_START if ((ret = store_uuids(cons))) goto done; TDB_TIMER_END("encoder/store_uuids") TDB_TIMER_START if ((ret = store_version(cons))) goto done; TDB_TIMER_END("encoder/store_version") TDB_TIMER_START if ((ret = tdb_encode(cons, (const tdb_item*)items_mmapped.data))) goto done; TDB_TIMER_END("encoder/encode") } done: if (items_mmapped.ptr) munmap(items_mmapped.ptr, items_mmapped.mmap_size); if (cons->tempfile[0]) unlink(cons->tempfile); if (!ret){ #ifdef HAVE_ARCHIVE_H if (cons->output_format == TDB_OPT_CONS_OUTPUT_FORMAT_PACKAGE) ret = cons_package(cons); #endif } return ret; } TDB_EXPORT tdb_error tdb_cons_set_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value value) { switch (key){ case TDB_OPT_CONS_OUTPUT_FORMAT: switch (value.value){ #ifdef HAVE_ARCHIVE_H case TDB_OPT_CONS_OUTPUT_FORMAT_PACKAGE: #endif case TDB_OPT_CONS_OUTPUT_FORMAT_DIR: cons->output_format = value.value; return 0; default: return TDB_ERR_INVALID_OPTION_VALUE; } case TDB_OPT_CONS_NO_BIGRAMS: cons->no_bigrams = !(!(value.value)); return 0; default: return TDB_ERR_UNKNOWN_OPTION; } } TDB_EXPORT tdb_error tdb_cons_get_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value *value) { switch (key){ case TDB_OPT_CONS_OUTPUT_FORMAT: value->value = cons->output_format; return 0; case TDB_OPT_CONS_NO_BIGRAMS: value->value = cons->no_bigrams; return 0; default: return TDB_ERR_UNKNOWN_OPTION; } } traildb-0.6+dfsg1/src/xxhash/0000700000175000017500000000000013106440271015364 5ustar czchenczchentraildb-0.6+dfsg1/src/xxhash/xxhash.c0000600000175000017500000007037213106440271017046 0ustar czchenczchen/* xxHash - Fast Hash algorithm Copyright (C) 2012-2015, Yann Collet BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact the author at : - xxHash source repository : https://github.com/Cyan4973/xxHash */ /************************************** * Tuning parameters **************************************/ /* XXH_FORCE_MEMORY_ACCESS * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable. * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal. * The below switch allow to select different access method for improved performance. * Method 0 (default) : use `memcpy()`. Safe and portable. * Method 1 : `__packed` statement. It depends on compiler extension (ie, not portable). * This method is safe if your compiler supports it, and *generally* as fast or faster than `memcpy`. * Method 2 : direct access. This method is portable but violate C standard. * It can generate buggy code on targets which generate assembly depending on alignment. * But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6) * See http://stackoverflow.com/a/32095106/646947 for details. * Prefer these methods in priority order (0 > 1 > 2) */ #ifndef XXH_FORCE_MEMORY_ACCESS /* can be defined externally, on command line for example */ # if defined(__GNUC__) && ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) ) # define XXH_FORCE_MEMORY_ACCESS 2 # elif defined(__INTEL_COMPILER) || \ (defined(__GNUC__) && ( defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7S__) )) # define XXH_FORCE_MEMORY_ACCESS 1 # endif #endif /* XXH_ACCEPT_NULL_INPUT_POINTER : * If the input pointer is a null pointer, xxHash default behavior is to trigger a memory access error, since it is a bad pointer. * When this option is enabled, xxHash output for null input pointers will be the same as a null-length input. * By default, this option is disabled. To enable it, uncomment below define : */ /* #define XXH_ACCEPT_NULL_INPUT_POINTER 1 */ /* XXH_FORCE_NATIVE_FORMAT : * By default, xxHash library provides endian-independant Hash values, based on little-endian convention. * Results are therefore identical for little-endian and big-endian CPU. * This comes at a performance cost for big-endian CPU, since some swapping is required to emulate little-endian format. * Should endian-independance be of no importance for your application, you may set the #define below to 1, * to improve speed for Big-endian CPU. * This option has no impact on Little_Endian CPU. */ #define XXH_FORCE_NATIVE_FORMAT 0 /* XXH_USELESS_ALIGN_BRANCH : * This is a minor performance trick, only useful with lots of very small keys. * It means : don't make a test between aligned/unaligned, because performance will be the same. * It saves one initial branch per hash. */ #if defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64) # define XXH_USELESS_ALIGN_BRANCH 1 #endif /************************************** * Compiler Specific Options ***************************************/ #ifdef _MSC_VER /* Visual Studio */ # pragma warning(disable : 4127) /* disable: C4127: conditional expression is constant */ # define FORCE_INLINE static __forceinline #else # if defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* C99 */ # ifdef __GNUC__ # define FORCE_INLINE static inline __attribute__((always_inline)) # else # define FORCE_INLINE static inline # endif # else # define FORCE_INLINE static # endif /* __STDC_VERSION__ */ #endif /************************************** * Includes & Memory related functions ***************************************/ #include "xxhash.h" /* Modify the local functions below should you wish to use some other memory routines */ /* for malloc(), free() */ #include static void* XXH_malloc(size_t s) { return malloc(s); } static void XXH_free (void* p) { free(p); } /* for memcpy() */ #include static void* XXH_memcpy(void* dest, const void* src, size_t size) { return memcpy(dest,src,size); } /************************************** * Basic Types ***************************************/ #if defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* C99 */ # include typedef uint8_t BYTE; typedef uint16_t U16; typedef uint32_t U32; typedef int32_t S32; typedef uint64_t U64; #else typedef unsigned char BYTE; typedef unsigned short U16; typedef unsigned int U32; typedef signed int S32; typedef unsigned long long U64; #endif #if (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==2)) /* Force direct memory access. Only works on CPU which support unaligned memory access in hardware */ static U32 XXH_read32(const void* memPtr) { return *(const U32*) memPtr; } static U64 XXH_read64(const void* memPtr) { return *(const U64*) memPtr; } #elif (defined(XXH_FORCE_MEMORY_ACCESS) && (XXH_FORCE_MEMORY_ACCESS==1)) /* __pack instructions are safer, but compiler specific, hence potentially problematic for some compilers */ /* currently only defined for gcc and icc */ typedef union { U32 u32; U64 u64; } __attribute__((packed)) unalign; static U32 XXH_read32(const void* ptr) { return ((const unalign*)ptr)->u32; } static U64 XXH_read64(const void* ptr) { return ((const unalign*)ptr)->u64; } #else /* portable and safe solution. Generally efficient. * see : http://stackoverflow.com/a/32095106/646947 */ static U32 XXH_read32(const void* memPtr) { U32 val; memcpy(&val, memPtr, sizeof(val)); return val; } static U64 XXH_read64(const void* memPtr) { U64 val; memcpy(&val, memPtr, sizeof(val)); return val; } #endif // XXH_FORCE_DIRECT_MEMORY_ACCESS /****************************************** * Compiler-specific Functions and Macros ******************************************/ #define GCC_VERSION (__GNUC__ * 100 + __GNUC_MINOR__) /* Note : although _rotl exists for minGW (GCC under windows), performance seems poor */ #if defined(_MSC_VER) # define XXH_rotl32(x,r) _rotl(x,r) # define XXH_rotl64(x,r) _rotl64(x,r) #else # define XXH_rotl32(x,r) ((x << r) | (x >> (32 - r))) # define XXH_rotl64(x,r) ((x << r) | (x >> (64 - r))) #endif #if defined(_MSC_VER) /* Visual Studio */ # define XXH_swap32 _byteswap_ulong # define XXH_swap64 _byteswap_uint64 #elif GCC_VERSION >= 403 # define XXH_swap32 __builtin_bswap32 # define XXH_swap64 __builtin_bswap64 #else static U32 XXH_swap32 (U32 x) { return ((x << 24) & 0xff000000 ) | ((x << 8) & 0x00ff0000 ) | ((x >> 8) & 0x0000ff00 ) | ((x >> 24) & 0x000000ff ); } static U64 XXH_swap64 (U64 x) { return ((x << 56) & 0xff00000000000000ULL) | ((x << 40) & 0x00ff000000000000ULL) | ((x << 24) & 0x0000ff0000000000ULL) | ((x << 8) & 0x000000ff00000000ULL) | ((x >> 8) & 0x00000000ff000000ULL) | ((x >> 24) & 0x0000000000ff0000ULL) | ((x >> 40) & 0x000000000000ff00ULL) | ((x >> 56) & 0x00000000000000ffULL); } #endif /*************************************** * Architecture Macros ***************************************/ typedef enum { XXH_bigEndian=0, XXH_littleEndian=1 } XXH_endianess; /* XXH_CPU_LITTLE_ENDIAN can be defined externally, for example one the compiler command line */ #ifndef XXH_CPU_LITTLE_ENDIAN static const int one = 1; # define XXH_CPU_LITTLE_ENDIAN (*(const char*)(&one)) #endif /***************************** * Memory reads *****************************/ typedef enum { XXH_aligned, XXH_unaligned } XXH_alignment; FORCE_INLINE U32 XXH_readLE32_align(const void* ptr, XXH_endianess endian, XXH_alignment align) { if (align==XXH_unaligned) return endian==XXH_littleEndian ? XXH_read32(ptr) : XXH_swap32(XXH_read32(ptr)); else return endian==XXH_littleEndian ? *(const U32*)ptr : XXH_swap32(*(const U32*)ptr); } FORCE_INLINE U32 XXH_readLE32(const void* ptr, XXH_endianess endian) { return XXH_readLE32_align(ptr, endian, XXH_unaligned); } FORCE_INLINE U64 XXH_readLE64_align(const void* ptr, XXH_endianess endian, XXH_alignment align) { if (align==XXH_unaligned) return endian==XXH_littleEndian ? XXH_read64(ptr) : XXH_swap64(XXH_read64(ptr)); else return endian==XXH_littleEndian ? *(const U64*)ptr : XXH_swap64(*(const U64*)ptr); } FORCE_INLINE U64 XXH_readLE64(const void* ptr, XXH_endianess endian) { return XXH_readLE64_align(ptr, endian, XXH_unaligned); } /*************************************** * Macros ***************************************/ #define XXH_STATIC_ASSERT(c) { enum { XXH_static_assert = 1/(!!(c)) }; } /* use only *after* variable declarations */ /*************************************** * Constants ***************************************/ #define PRIME32_1 2654435761U #define PRIME32_2 2246822519U #define PRIME32_3 3266489917U #define PRIME32_4 668265263U #define PRIME32_5 374761393U #define PRIME64_1 11400714785074694791ULL #define PRIME64_2 14029467366897019727ULL #define PRIME64_3 1609587929392839161ULL #define PRIME64_4 9650029242287828579ULL #define PRIME64_5 2870177450012600261ULL /***************************** * Simple Hash Functions *****************************/ FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed, XXH_endianess endian, XXH_alignment align) { const BYTE* p = (const BYTE*)input; const BYTE* bEnd = p + len; U32 h32; #define XXH_get32bits(p) XXH_readLE32_align(p, endian, align) #ifdef XXH_ACCEPT_NULL_INPUT_POINTER if (p==NULL) { len=0; bEnd=p=(const BYTE*)(size_t)16; } #endif if (len>=16) { const BYTE* const limit = bEnd - 16; U32 v1 = seed + PRIME32_1 + PRIME32_2; U32 v2 = seed + PRIME32_2; U32 v3 = seed + 0; U32 v4 = seed - PRIME32_1; do { v1 += XXH_get32bits(p) * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1; p+=4; v2 += XXH_get32bits(p) * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1; p+=4; v3 += XXH_get32bits(p) * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1; p+=4; v4 += XXH_get32bits(p) * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1; p+=4; } while (p<=limit); h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18); } else { h32 = seed + PRIME32_5; } h32 += (U32) len; while (p+4<=bEnd) { h32 += XXH_get32bits(p) * PRIME32_3; h32 = XXH_rotl32(h32, 17) * PRIME32_4 ; p+=4; } while (p> 15; h32 *= PRIME32_2; h32 ^= h32 >> 13; h32 *= PRIME32_3; h32 ^= h32 >> 16; return h32; } unsigned int XXH32 (const void* input, size_t len, unsigned int seed) { #if 0 /* Simple version, good for code maintenance, but unfortunately slow for small inputs */ XXH32_state_t state; XXH32_reset(&state, seed); XXH32_update(&state, input, len); return XXH32_digest(&state); #else XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; # if !defined(XXH_USELESS_ALIGN_BRANCH) if ((((size_t)input) & 3) == 0) /* Input is 4-bytes aligned, leverage the speed benefit */ { if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned); else return XXH32_endian_align(input, len, seed, XXH_bigEndian, XXH_aligned); } # endif if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_unaligned); else return XXH32_endian_align(input, len, seed, XXH_bigEndian, XXH_unaligned); #endif } FORCE_INLINE U64 XXH64_endian_align(const void* input, size_t len, U64 seed, XXH_endianess endian, XXH_alignment align) { const BYTE* p = (const BYTE*)input; const BYTE* bEnd = p + len; U64 h64; #define XXH_get64bits(p) XXH_readLE64_align(p, endian, align) #ifdef XXH_ACCEPT_NULL_INPUT_POINTER if (p==NULL) { len=0; bEnd=p=(const BYTE*)(size_t)32; } #endif if (len>=32) { const BYTE* const limit = bEnd - 32; U64 v1 = seed + PRIME64_1 + PRIME64_2; U64 v2 = seed + PRIME64_2; U64 v3 = seed + 0; U64 v4 = seed - PRIME64_1; do { v1 += XXH_get64bits(p) * PRIME64_2; p+=8; v1 = XXH_rotl64(v1, 31); v1 *= PRIME64_1; v2 += XXH_get64bits(p) * PRIME64_2; p+=8; v2 = XXH_rotl64(v2, 31); v2 *= PRIME64_1; v3 += XXH_get64bits(p) * PRIME64_2; p+=8; v3 = XXH_rotl64(v3, 31); v3 *= PRIME64_1; v4 += XXH_get64bits(p) * PRIME64_2; p+=8; v4 = XXH_rotl64(v4, 31); v4 *= PRIME64_1; } while (p<=limit); h64 = XXH_rotl64(v1, 1) + XXH_rotl64(v2, 7) + XXH_rotl64(v3, 12) + XXH_rotl64(v4, 18); v1 *= PRIME64_2; v1 = XXH_rotl64(v1, 31); v1 *= PRIME64_1; h64 ^= v1; h64 = h64 * PRIME64_1 + PRIME64_4; v2 *= PRIME64_2; v2 = XXH_rotl64(v2, 31); v2 *= PRIME64_1; h64 ^= v2; h64 = h64 * PRIME64_1 + PRIME64_4; v3 *= PRIME64_2; v3 = XXH_rotl64(v3, 31); v3 *= PRIME64_1; h64 ^= v3; h64 = h64 * PRIME64_1 + PRIME64_4; v4 *= PRIME64_2; v4 = XXH_rotl64(v4, 31); v4 *= PRIME64_1; h64 ^= v4; h64 = h64 * PRIME64_1 + PRIME64_4; } else { h64 = seed + PRIME64_5; } h64 += (U64) len; while (p+8<=bEnd) { U64 k1 = XXH_get64bits(p); k1 *= PRIME64_2; k1 = XXH_rotl64(k1,31); k1 *= PRIME64_1; h64 ^= k1; h64 = XXH_rotl64(h64,27) * PRIME64_1 + PRIME64_4; p+=8; } if (p+4<=bEnd) { h64 ^= (U64)(XXH_get32bits(p)) * PRIME64_1; h64 = XXH_rotl64(h64, 23) * PRIME64_2 + PRIME64_3; p+=4; } while (p> 33; h64 *= PRIME64_2; h64 ^= h64 >> 29; h64 *= PRIME64_3; h64 ^= h64 >> 32; return h64; } unsigned long long XXH64 (const void* input, size_t len, unsigned long long seed) { #if 0 /* Simple version, good for code maintenance, but unfortunately slow for small inputs */ XXH64_state_t state; XXH64_reset(&state, seed); XXH64_update(&state, input, len); return XXH64_digest(&state); #else XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; # if !defined(XXH_USELESS_ALIGN_BRANCH) if ((((size_t)input) & 7)==0) /* Input is aligned, let's leverage the speed advantage */ { if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH64_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned); else return XXH64_endian_align(input, len, seed, XXH_bigEndian, XXH_aligned); } # endif if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH64_endian_align(input, len, seed, XXH_littleEndian, XXH_unaligned); else return XXH64_endian_align(input, len, seed, XXH_bigEndian, XXH_unaligned); #endif } /**************************************************** * Advanced Hash Functions ****************************************************/ /*** Allocation ***/ typedef struct { U64 total_len; U32 seed; U32 v1; U32 v2; U32 v3; U32 v4; U32 mem32[4]; /* defined as U32 for alignment */ U32 memsize; } XXH_istate32_t; typedef struct { U64 total_len; U64 seed; U64 v1; U64 v2; U64 v3; U64 v4; U64 mem64[4]; /* defined as U64 for alignment */ U32 memsize; } XXH_istate64_t; XXH32_state_t* XXH32_createState(void) { XXH_STATIC_ASSERT(sizeof(XXH32_state_t) >= sizeof(XXH_istate32_t)); /* A compilation error here means XXH32_state_t is not large enough */ return (XXH32_state_t*)XXH_malloc(sizeof(XXH32_state_t)); } XXH_errorcode XXH32_freeState(XXH32_state_t* statePtr) { XXH_free(statePtr); return XXH_OK; } XXH64_state_t* XXH64_createState(void) { XXH_STATIC_ASSERT(sizeof(XXH64_state_t) >= sizeof(XXH_istate64_t)); /* A compilation error here means XXH64_state_t is not large enough */ return (XXH64_state_t*)XXH_malloc(sizeof(XXH64_state_t)); } XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr) { XXH_free(statePtr); return XXH_OK; } /*** Hash feed ***/ XXH_errorcode XXH32_reset(XXH32_state_t* state_in, unsigned int seed) { XXH_istate32_t* state = (XXH_istate32_t*) state_in; state->seed = seed; state->v1 = seed + PRIME32_1 + PRIME32_2; state->v2 = seed + PRIME32_2; state->v3 = seed + 0; state->v4 = seed - PRIME32_1; state->total_len = 0; state->memsize = 0; return XXH_OK; } XXH_errorcode XXH64_reset(XXH64_state_t* state_in, unsigned long long seed) { XXH_istate64_t* state = (XXH_istate64_t*) state_in; state->seed = seed; state->v1 = seed + PRIME64_1 + PRIME64_2; state->v2 = seed + PRIME64_2; state->v3 = seed + 0; state->v4 = seed - PRIME64_1; state->total_len = 0; state->memsize = 0; return XXH_OK; } FORCE_INLINE XXH_errorcode XXH32_update_endian (XXH32_state_t* state_in, const void* input, size_t len, XXH_endianess endian) { XXH_istate32_t* state = (XXH_istate32_t *) state_in; const BYTE* p = (const BYTE*)input; const BYTE* const bEnd = p + len; #ifdef XXH_ACCEPT_NULL_INPUT_POINTER if (input==NULL) return XXH_ERROR; #endif state->total_len += len; if (state->memsize + len < 16) /* fill in tmp buffer */ { XXH_memcpy((BYTE*)(state->mem32) + state->memsize, input, len); state->memsize += (U32)len; return XXH_OK; } if (state->memsize) /* some data left from previous update */ { XXH_memcpy((BYTE*)(state->mem32) + state->memsize, input, 16-state->memsize); { const U32* p32 = state->mem32; state->v1 += XXH_readLE32(p32, endian) * PRIME32_2; state->v1 = XXH_rotl32(state->v1, 13); state->v1 *= PRIME32_1; p32++; state->v2 += XXH_readLE32(p32, endian) * PRIME32_2; state->v2 = XXH_rotl32(state->v2, 13); state->v2 *= PRIME32_1; p32++; state->v3 += XXH_readLE32(p32, endian) * PRIME32_2; state->v3 = XXH_rotl32(state->v3, 13); state->v3 *= PRIME32_1; p32++; state->v4 += XXH_readLE32(p32, endian) * PRIME32_2; state->v4 = XXH_rotl32(state->v4, 13); state->v4 *= PRIME32_1; p32++; } p += 16-state->memsize; state->memsize = 0; } if (p <= bEnd-16) { const BYTE* const limit = bEnd - 16; U32 v1 = state->v1; U32 v2 = state->v2; U32 v3 = state->v3; U32 v4 = state->v4; do { v1 += XXH_readLE32(p, endian) * PRIME32_2; v1 = XXH_rotl32(v1, 13); v1 *= PRIME32_1; p+=4; v2 += XXH_readLE32(p, endian) * PRIME32_2; v2 = XXH_rotl32(v2, 13); v2 *= PRIME32_1; p+=4; v3 += XXH_readLE32(p, endian) * PRIME32_2; v3 = XXH_rotl32(v3, 13); v3 *= PRIME32_1; p+=4; v4 += XXH_readLE32(p, endian) * PRIME32_2; v4 = XXH_rotl32(v4, 13); v4 *= PRIME32_1; p+=4; } while (p<=limit); state->v1 = v1; state->v2 = v2; state->v3 = v3; state->v4 = v4; } if (p < bEnd) { XXH_memcpy(state->mem32, p, bEnd-p); state->memsize = (int)(bEnd-p); } return XXH_OK; } XXH_errorcode XXH32_update (XXH32_state_t* state_in, const void* input, size_t len) { XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH32_update_endian(state_in, input, len, XXH_littleEndian); else return XXH32_update_endian(state_in, input, len, XXH_bigEndian); } FORCE_INLINE U32 XXH32_digest_endian (const XXH32_state_t* state_in, XXH_endianess endian) { const XXH_istate32_t* state = (const XXH_istate32_t*) state_in; const BYTE * p = (const BYTE*)state->mem32; const BYTE* bEnd = (const BYTE*)(state->mem32) + state->memsize; U32 h32; if (state->total_len >= 16) { h32 = XXH_rotl32(state->v1, 1) + XXH_rotl32(state->v2, 7) + XXH_rotl32(state->v3, 12) + XXH_rotl32(state->v4, 18); } else { h32 = state->seed + PRIME32_5; } h32 += (U32) state->total_len; while (p+4<=bEnd) { h32 += XXH_readLE32(p, endian) * PRIME32_3; h32 = XXH_rotl32(h32, 17) * PRIME32_4; p+=4; } while (p> 15; h32 *= PRIME32_2; h32 ^= h32 >> 13; h32 *= PRIME32_3; h32 ^= h32 >> 16; return h32; } unsigned int XXH32_digest (const XXH32_state_t* state_in) { XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH32_digest_endian(state_in, XXH_littleEndian); else return XXH32_digest_endian(state_in, XXH_bigEndian); } FORCE_INLINE XXH_errorcode XXH64_update_endian (XXH64_state_t* state_in, const void* input, size_t len, XXH_endianess endian) { XXH_istate64_t * state = (XXH_istate64_t *) state_in; const BYTE* p = (const BYTE*)input; const BYTE* const bEnd = p + len; #ifdef XXH_ACCEPT_NULL_INPUT_POINTER if (input==NULL) return XXH_ERROR; #endif state->total_len += len; if (state->memsize + len < 32) /* fill in tmp buffer */ { XXH_memcpy(((BYTE*)state->mem64) + state->memsize, input, len); state->memsize += (U32)len; return XXH_OK; } if (state->memsize) /* some data left from previous update */ { XXH_memcpy(((BYTE*)state->mem64) + state->memsize, input, 32-state->memsize); { const U64* p64 = state->mem64; state->v1 += XXH_readLE64(p64, endian) * PRIME64_2; state->v1 = XXH_rotl64(state->v1, 31); state->v1 *= PRIME64_1; p64++; state->v2 += XXH_readLE64(p64, endian) * PRIME64_2; state->v2 = XXH_rotl64(state->v2, 31); state->v2 *= PRIME64_1; p64++; state->v3 += XXH_readLE64(p64, endian) * PRIME64_2; state->v3 = XXH_rotl64(state->v3, 31); state->v3 *= PRIME64_1; p64++; state->v4 += XXH_readLE64(p64, endian) * PRIME64_2; state->v4 = XXH_rotl64(state->v4, 31); state->v4 *= PRIME64_1; p64++; } p += 32-state->memsize; state->memsize = 0; } if (p+32 <= bEnd) { const BYTE* const limit = bEnd - 32; U64 v1 = state->v1; U64 v2 = state->v2; U64 v3 = state->v3; U64 v4 = state->v4; do { v1 += XXH_readLE64(p, endian) * PRIME64_2; v1 = XXH_rotl64(v1, 31); v1 *= PRIME64_1; p+=8; v2 += XXH_readLE64(p, endian) * PRIME64_2; v2 = XXH_rotl64(v2, 31); v2 *= PRIME64_1; p+=8; v3 += XXH_readLE64(p, endian) * PRIME64_2; v3 = XXH_rotl64(v3, 31); v3 *= PRIME64_1; p+=8; v4 += XXH_readLE64(p, endian) * PRIME64_2; v4 = XXH_rotl64(v4, 31); v4 *= PRIME64_1; p+=8; } while (p<=limit); state->v1 = v1; state->v2 = v2; state->v3 = v3; state->v4 = v4; } if (p < bEnd) { XXH_memcpy(state->mem64, p, bEnd-p); state->memsize = (int)(bEnd-p); } return XXH_OK; } XXH_errorcode XXH64_update (XXH64_state_t* state_in, const void* input, size_t len) { XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH64_update_endian(state_in, input, len, XXH_littleEndian); else return XXH64_update_endian(state_in, input, len, XXH_bigEndian); } FORCE_INLINE U64 XXH64_digest_endian (const XXH64_state_t* state_in, XXH_endianess endian) { const XXH_istate64_t * state = (const XXH_istate64_t *) state_in; const BYTE * p = (const BYTE*)state->mem64; const BYTE* bEnd = (const BYTE*)state->mem64 + state->memsize; U64 h64; if (state->total_len >= 32) { U64 v1 = state->v1; U64 v2 = state->v2; U64 v3 = state->v3; U64 v4 = state->v4; h64 = XXH_rotl64(v1, 1) + XXH_rotl64(v2, 7) + XXH_rotl64(v3, 12) + XXH_rotl64(v4, 18); v1 *= PRIME64_2; v1 = XXH_rotl64(v1, 31); v1 *= PRIME64_1; h64 ^= v1; h64 = h64*PRIME64_1 + PRIME64_4; v2 *= PRIME64_2; v2 = XXH_rotl64(v2, 31); v2 *= PRIME64_1; h64 ^= v2; h64 = h64*PRIME64_1 + PRIME64_4; v3 *= PRIME64_2; v3 = XXH_rotl64(v3, 31); v3 *= PRIME64_1; h64 ^= v3; h64 = h64*PRIME64_1 + PRIME64_4; v4 *= PRIME64_2; v4 = XXH_rotl64(v4, 31); v4 *= PRIME64_1; h64 ^= v4; h64 = h64*PRIME64_1 + PRIME64_4; } else { h64 = state->seed + PRIME64_5; } h64 += (U64) state->total_len; while (p+8<=bEnd) { U64 k1 = XXH_readLE64(p, endian); k1 *= PRIME64_2; k1 = XXH_rotl64(k1,31); k1 *= PRIME64_1; h64 ^= k1; h64 = XXH_rotl64(h64,27) * PRIME64_1 + PRIME64_4; p+=8; } if (p+4<=bEnd) { h64 ^= (U64)(XXH_readLE32(p, endian)) * PRIME64_1; h64 = XXH_rotl64(h64, 23) * PRIME64_2 + PRIME64_3; p+=4; } while (p> 33; h64 *= PRIME64_2; h64 ^= h64 >> 29; h64 *= PRIME64_3; h64 ^= h64 >> 32; return h64; } unsigned long long XXH64_digest (const XXH64_state_t* state_in) { XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN; if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT) return XXH64_digest_endian(state_in, XXH_littleEndian); else return XXH64_digest_endian(state_in, XXH_bigEndian); } traildb-0.6+dfsg1/src/xxhash/LICENSE.txt0000600000175000017500000000244213106440271017213 0ustar czchenczchenxxHash Library Copyright (c) 2012-2014, Yann Collet All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. traildb-0.6+dfsg1/src/xxhash/xxhash.h0000600000175000017500000001654013106440271017050 0ustar czchenczchen/* xxHash - Extremely Fast Hash algorithm Header File Copyright (C) 2012-2015, Yann Collet. BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. You can contact the author at : - xxHash source repository : https://github.com/Cyan4973/xxHash */ /* Notice extracted from xxHash homepage : xxHash is an extremely fast Hash algorithm, running at RAM speed limits. It also successfully passes all tests from the SMHasher suite. Comparison (single thread, Windows Seven 32 bits, using SMHasher on a Core 2 Duo @3GHz) Name Speed Q.Score Author xxHash 5.4 GB/s 10 CrapWow 3.2 GB/s 2 Andrew MumurHash 3a 2.7 GB/s 10 Austin Appleby SpookyHash 2.0 GB/s 10 Bob Jenkins SBox 1.4 GB/s 9 Bret Mulvey Lookup3 1.2 GB/s 9 Bob Jenkins SuperFastHash 1.2 GB/s 1 Paul Hsieh CityHash64 1.05 GB/s 10 Pike & Alakuijala FNV 0.55 GB/s 5 Fowler, Noll, Vo CRC32 0.43 GB/s 9 MD5-32 0.33 GB/s 10 Ronald L. Rivest SHA1-32 0.28 GB/s 10 Q.Score is a measure of quality of the hash function. It depends on successfully passing SMHasher test set. 10 is a perfect score. A 64-bits version, named XXH64, is available since r35. It offers much better speed, but for 64-bits applications only. Name Speed on 64 bits Speed on 32 bits XXH64 13.8 GB/s 1.9 GB/s XXH32 6.8 GB/s 6.0 GB/s */ #pragma once #if defined (__cplusplus) extern "C" { #endif /***************************** * Definitions *****************************/ #include /* size_t */ typedef enum { XXH_OK=0, XXH_ERROR } XXH_errorcode; /***************************** * Namespace Emulation *****************************/ /* Motivations : If you need to include xxHash into your library, but wish to avoid xxHash symbols to be present on your library interface in an effort to avoid potential name collision if another library also includes xxHash, you can use XXH_NAMESPACE, which will automatically prefix any symbol from xxHash with the value of XXH_NAMESPACE (so avoid to keep it NULL, and avoid numeric values). Note that no change is required within the calling program : it can still call xxHash functions using their regular name. They will be automatically translated by this header. */ #ifdef XXH_NAMESPACE # define XXH_CAT(A,B) A##B # define XXH_NAME2(A,B) XXH_CAT(A,B) # define XXH32 XXH_NAME2(XXH_NAMESPACE, XXH32) # define XXH64 XXH_NAME2(XXH_NAMESPACE, XXH64) # define XXH32_createState XXH_NAME2(XXH_NAMESPACE, XXH32_createState) # define XXH64_createState XXH_NAME2(XXH_NAMESPACE, XXH64_createState) # define XXH32_freeState XXH_NAME2(XXH_NAMESPACE, XXH32_freeState) # define XXH64_freeState XXH_NAME2(XXH_NAMESPACE, XXH64_freeState) # define XXH32_reset XXH_NAME2(XXH_NAMESPACE, XXH32_reset) # define XXH64_reset XXH_NAME2(XXH_NAMESPACE, XXH64_reset) # define XXH32_update XXH_NAME2(XXH_NAMESPACE, XXH32_update) # define XXH64_update XXH_NAME2(XXH_NAMESPACE, XXH64_update) # define XXH32_digest XXH_NAME2(XXH_NAMESPACE, XXH32_digest) # define XXH64_digest XXH_NAME2(XXH_NAMESPACE, XXH64_digest) #endif /***************************** * Simple Hash Functions *****************************/ unsigned int XXH32 (const void* input, size_t length, unsigned seed); unsigned long long XXH64 (const void* input, size_t length, unsigned long long seed); /* XXH32() : Calculate the 32-bits hash of sequence "length" bytes stored at memory address "input". The memory between input & input+length must be valid (allocated and read-accessible). "seed" can be used to alter the result predictably. This function successfully passes all SMHasher tests. Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark) : 5.4 GB/s XXH64() : Calculate the 64-bits hash of sequence of length "len" stored at memory address "input". Faster on 64-bits systems. Slower on 32-bits systems. */ /***************************** * Advanced Hash Functions *****************************/ typedef struct { long long ll[ 6]; } XXH32_state_t; typedef struct { long long ll[11]; } XXH64_state_t; /* These structures allow static allocation of XXH states. States must then be initialized using XXHnn_reset() before first use. If you prefer dynamic allocation, please refer to functions below. */ XXH32_state_t* XXH32_createState(void); XXH_errorcode XXH32_freeState(XXH32_state_t* statePtr); XXH64_state_t* XXH64_createState(void); XXH_errorcode XXH64_freeState(XXH64_state_t* statePtr); /* These functions create and release memory for XXH state. States must then be initialized using XXHnn_reset() before first use. */ XXH_errorcode XXH32_reset (XXH32_state_t* statePtr, unsigned seed); XXH_errorcode XXH32_update (XXH32_state_t* statePtr, const void* input, size_t length); unsigned int XXH32_digest (const XXH32_state_t* statePtr); XXH_errorcode XXH64_reset (XXH64_state_t* statePtr, unsigned long long seed); XXH_errorcode XXH64_update (XXH64_state_t* statePtr, const void* input, size_t length); unsigned long long XXH64_digest (const XXH64_state_t* statePtr); /* These functions calculate the xxHash of an input provided in multiple smaller packets, as opposed to an input provided as a single block. XXH state space must first be allocated, using either static or dynamic method provided above. Start a new hash by initializing state with a seed, using XXHnn_reset(). Then, feed the hash state by calling XXHnn_update() as many times as necessary. Obviously, input must be valid, meaning allocated and read accessible. The function returns an error code, with 0 meaning OK, and any other value meaning there is an error. Finally, you can produce a hash anytime, by using XXHnn_digest(). This function returns the final nn-bits hash. You can nonetheless continue feeding the hash state with more input, and therefore get some new hashes, by calling again XXHnn_digest(). When you are done, don't forget to free XXH state space, using typically XXHnn_freeState(). */ #if defined (__cplusplus) } #endif traildb-0.6+dfsg1/src/tdb_package.h0000600000175000017500000000121313106440271016455 0ustar czchenczchen #ifndef __TDB_PACKAGE_H__ #define __TDB_PACKAGE_H__ #include #include "tdb_internal.h" #include "tdb_error.h" #define TDB_TAR_MAGIC "TAR TOC FOR TDB VER 1\n" #define TOC_FILE_OFFSET 2560 /* = (len(HEADER_FILES) * 2 + 1) * 512 */ tdb_error cons_package(const tdb_cons *cons); tdb_error open_package(tdb *db, const char *root); void free_package(tdb *db); FILE *package_fopen(const char *fname, const char *root, const tdb *db); int package_fclose(FILE *f); int package_mmap(const char *fname, const char *root, struct tdb_file *dst, const tdb *db); #endif /* __TDB_PACKAGE_H__ */ traildb-0.6+dfsg1/src/tdb_huffman.c0000600000175000017500000002653713106440271016521 0ustar czchenczchen#define _DEFAULT_SOURCE #define _GNU_SOURCE #include #include #include #include #include #include #include "tdb_queue.h" #include "tdb_profile.h" #include "tdb_huffman.h" #include "tdb_error.h" #include "judy_128_map.h" #define MIN(a,b) ((a)>(b)?(b):(a)) struct hnode{ __uint128_t symbol; uint32_t code; uint32_t num_bits; uint64_t weight; struct hnode *left; struct hnode *right; }; struct sortpair{ __uint128_t key; Word_t value; }; static uint8_t bits_needed(uint64_t max) { uint64_t x = max; uint8_t bits = x ? 0: 1; while (x){ x >>= 1; ++bits; } return bits; } static int compare(const void *p1, const void *p2) { const struct sortpair *x = (const struct sortpair*)p1; const struct sortpair *y = (const struct sortpair*)p2; if (x->value > y->value) return -1; else if (x->value < y->value) return 1; return 0; } static void *sort_j128m_fun(__uint128_t key, Word_t *value, void *state) { struct sortpair *pair = (struct sortpair*)state; pair->key = key; pair->value = *value; return ++pair; } static struct sortpair *sort_j128m(const struct judy_128_map *j128m, uint64_t *num_items) { struct sortpair *pairs; *num_items = j128m_num_keys(j128m); if (!(pairs = calloc(*num_items, sizeof(struct sortpair)))) return NULL; if (*num_items == 0) return pairs; j128m_fold(j128m, sort_j128m_fun, pairs); qsort(pairs, *num_items, sizeof(struct sortpair), compare); return pairs; } static void allocate_codewords(struct hnode *node, uint32_t code, uint32_t depth) { if (node == NULL) return; if (depth < 16 && (node->right || node->left)){ allocate_codewords(node->left, code, depth + 1); allocate_codewords(node->right, code | (1U << depth), depth + 1); }else{ node->code = code; node->num_bits = depth; } } static struct hnode *pop_min_weight(struct hnode *symbols, uint32_t *num_symbols, struct tdb_queue *nodes) { const struct hnode *n = (const struct hnode*)tdb_queue_peek(nodes); if (!*num_symbols || (n && n->weight < symbols[*num_symbols - 1].weight)) return tdb_queue_pop(nodes); else if (*num_symbols) return &symbols[--*num_symbols]; return NULL; } static int huffman_code(struct hnode *symbols, uint32_t num) { struct tdb_queue *nodes = NULL; struct hnode *newnodes = NULL; uint32_t new_i = 0; if (!num) return 0; if (!(nodes = tdb_queue_new(num * 2))) return TDB_ERR_NOMEM; if (!(newnodes = malloc(num * sizeof(struct hnode)))){ tdb_queue_free(nodes); return TDB_ERR_NOMEM; } /* construct the huffman tree bottom up */ while (num || tdb_queue_length(nodes) > 1){ struct hnode *new = &newnodes[new_i++]; new->left = pop_min_weight(symbols, &num, nodes); new->right = pop_min_weight(symbols, &num, nodes); new->weight = (new->left ? new->left->weight: 0) + (new->right ? new->right->weight: 0); tdb_queue_push(nodes, new); } /* allocate codewords top down (depth-first) */ allocate_codewords(tdb_queue_pop(nodes), 0, 0); free(newnodes); tdb_queue_free(nodes); return 0; } static int sort_symbols(const struct judy_128_map *freqs, uint64_t *totalfreq, uint32_t *num_symbols, struct hnode *book) { uint64_t i; struct sortpair *pairs; uint64_t num; if (!(pairs = sort_j128m(freqs, &num))) return TDB_ERR_NOMEM; *totalfreq = 0; for (i = 0; i < num; i++) *totalfreq += pairs[i].value; *num_symbols = (uint32_t)(MIN(HUFF_CODEBOOK_SIZE, num)); for (i = 0; i < *num_symbols; i++){ book[i].symbol = pairs[i].key; book[i].weight = pairs[i].value; } free(pairs); return 0; } #ifdef TDB_DEBUG_HUFFMAN static void print_codeword(const struct hnode *node) { uint32_t j; for (j = 0; j < node->num_bits; j++) fprintf(stderr, "%u", (node->code & (1U << j) ? 1: 0)); } static void output_stats(const struct hnode *book, uint32_t num_symbols, uint64_t tot) { fprintf(stderr, "#codewords: %u\n", num_symbols); uint64_t cum = 0; uint32_t i; fprintf(stderr, "index) gramtype [field value] freq prob cum\n"); for (i = 0; i < num_symbols; i++){ long long unsigned int sym = book[i].symbol; long long unsigned int sym2 = sym >> 32; uint64_t f = book[i].weight; cum += f; fprintf(stderr, "%u) ", i); if (sym2 & 255){ fprintf(stderr, "bi [%llu %llu | %llu %llu] ", sym & 255, (sym >> 8) & ((1 << 24) - 1), sym2 & 255, sym2 >> 8); }else fprintf(stderr, "uni [%llu %llu] ", sym & 255, sym >> 8); fprintf(stderr, "%lu %2.3f %2.3f | ", f, 100. * (double)f / (double)tot, 100. * (double)cum / (double)tot); print_codeword(&book[i]); fprintf(stderr, "\n"); } } #endif static int make_codemap(struct hnode *nodes, uint32_t num_symbols, struct judy_128_map *codemap) { uint32_t i = num_symbols; while (i--){ if (nodes[i].num_bits){ /* TODO TDB_ERR_NOMEM handling */ Word_t *ptr = j128m_insert(codemap, nodes[i].symbol); *ptr = nodes[i].code | (nodes[i].num_bits << 16); } } return 0; } struct field_stats *huff_field_stats(const uint64_t *field_cardinalities, uint64_t num_fields, uint64_t max_timestamp_delta) { uint64_t i; struct field_stats *fstats; if (!(fstats = malloc(sizeof(struct field_stats) + num_fields * 4))) return NULL; fstats->field_id_bits = bits_needed(num_fields); fstats->field_bits[0] = bits_needed(max_timestamp_delta); for (i = 0; i < num_fields - 1; i++) fstats->field_bits[i + 1] = bits_needed(field_cardinalities[i]); return fstats; } int huff_create_codemap(const struct judy_128_map *gram_freqs, struct judy_128_map *codemap) { struct hnode *nodes; uint64_t total_freq; uint32_t num_symbols; int ret = 0; TDB_TIMER_DEF if (!(nodes = calloc(HUFF_CODEBOOK_SIZE, sizeof(struct hnode)))){ ret = TDB_ERR_NOMEM; goto done; } TDB_TIMER_START if ((ret = sort_symbols(gram_freqs, &total_freq, &num_symbols, nodes))) goto done; TDB_TIMER_END("huffman/sort_symbols") TDB_TIMER_START if ((ret = huffman_code(nodes, num_symbols))) goto done; TDB_TIMER_END("huffman/huffman_code") #ifdef TDB_DEBUG_HUFFMAN if (getenv("TDB_DEBUG_HUFFMAN")) output_stats(nodes, num_symbols, total_freq); #endif TDB_TIMER_START if ((ret = make_codemap(nodes, num_symbols, codemap))) goto done; TDB_TIMER_END("huffman/make_codemap") done: free(nodes); return ret; } static inline void encode_gram(const struct judy_128_map *codemap, __uint128_t gram, char *buf, uint64_t *offs, const struct field_stats *fstats) { const tdb_field field = tdb_item_field(HUFF_BIGRAM_TO_ITEM(gram)); const tdb_val value = tdb_item_val(HUFF_BIGRAM_TO_ITEM(gram)); const uint32_t literal_bits = 1 + fstats->field_id_bits + fstats->field_bits[field]; uint64_t huff_code, huff_bits; Word_t *ptr = j128m_get(codemap, gram); if (ptr){ /* codeword: prefix code by an up bit */ huff_code = 1U | (((uint32_t)HUFF_CODE(*ptr)) << 1U); huff_bits = HUFF_BITS(*ptr) + 1; } if (ptr && (HUFF_IS_BIGRAM(gram) || huff_bits < literal_bits)){ /* write huffman-coded codeword */ write_bits(buf, *offs, huff_code); *offs += huff_bits; }else if (HUFF_IS_BIGRAM(gram)){ /* non-huffman bigrams are encoded as two unigrams */ encode_gram(codemap, HUFF_BIGRAM_TO_ITEM(gram), buf, offs, fstats); encode_gram(codemap, HUFF_BIGRAM_OTHER_ITEM(gram), buf, offs, fstats); }else{ /* write literal: [0 (1 bit) | field (field_bits) | value (field_bits[field])] huff_encoded_size_max_bits() must match with the above definition in tdb_huffman.h */ write_bits(buf, *offs + 1, field); *offs += fstats->field_id_bits + 1; write_bits64(buf, *offs, value); *offs += fstats->field_bits[field]; } } void huff_encode_grams(const struct judy_128_map *codemap, const __uint128_t *grams, uint64_t num_grams, char *buf, uint64_t *offs, const struct field_stats *fstats) { uint64_t i = 0; for (i = 0; i < num_grams; i++) encode_gram(codemap, grams[i], buf, offs, fstats); } static void *create_codebook_fun(__uint128_t symbol, Word_t *value, void *state) { struct huff_codebook *book = (struct huff_codebook*)state; uint32_t code = HUFF_CODE(*value); uint32_t n = HUFF_BITS(*value); uint32_t j = 1U << (16 - n); while (j--){ uint32_t k = code | (j << n); book[k].symbol = symbol; book[k].bits = n; } return state; } struct huff_codebook *huff_create_codebook(const struct judy_128_map *codemap, uint32_t *size) { struct huff_codebook *book; *size = HUFF_CODEBOOK_SIZE * sizeof(struct huff_codebook); if (!(book = calloc(1, *size))) return NULL; j128m_fold(codemap, create_codebook_fun, book); return book; } /* this function converts old 64-bit symbols in v0 to new 128-bit symbols in v1 */ int huff_convert_v0_codebook(struct tdb_file *codebook) { const struct huff_codebook_v0{ uint64_t symbol; uint32_t bits; } __attribute__((packed)) *old = (const struct huff_codebook_v0*)codebook->data; uint64_t i; uint64_t size = HUFF_CODEBOOK_SIZE * sizeof(struct huff_codebook); struct huff_codebook *new; /* we want to allocate memory with mmap() and not malloc() so that tdb_file can be munmap()'ed as usual */ void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (p == MAP_FAILED) return TDB_ERR_NOMEM; new = (struct huff_codebook*)p; for (i = 0; i < HUFF_CODEBOOK_SIZE; i++){ /* extract the second part of the bigram */ __uint128_t gram = old[i].symbol >> 32; gram <<= 64; /* extract the first part of the bigram */ gram |= (old[i].symbol & UINT32_MAX); new[i].symbol = gram; new[i].bits = old[i].bits; } munmap(codebook->ptr, codebook->mmap_size); codebook->data = codebook->ptr = p; codebook->size = codebook->mmap_size = size; return 0; } traildb-0.6+dfsg1/src/tdb_encode_model.c0000600000175000017500000003413313106440271017501 0ustar czchenczchen #include #include #include #include #undef JUDYERROR #define JUDYERROR(CallerFile, CallerLine, JudyFunc, JudyErrno, JudyErrID) \ { \ if ((JudyErrno) == JU_ERRNO_NOMEM) \ goto out_of_memory; \ } #include #include "tdb_internal.h" #include "tdb_encode_model.h" #include "tdb_huffman.h" #include "tdb_error.h" #include "tdb_io.h" #define DSFMT_MEXP 521 #include "dsfmt/dSFMT.h" #define SAMPLE_SIZE (0.1 * RAND_MAX) #define RANDOM_SEED 238713 #define UNIGRAM_SUPPORT 0.00001 #define NUM_EVENTS_SAMPLING_THRESHOLD 1000000 #define INITIAL_GRAM_BUF_LEN (256 * 256) #define MIN(a,b) ((a)>(b)?(b):(a)) /* event op handles one *event* (not one trail) */ typedef int (*event_op)(const tdb_item *encoded, uint64_t n, const struct tdb_grouped_event *ev, void *state); struct ngram_state{ Pvoid_t candidates; struct judy_128_map ngram_freqs; Pvoid_t final_freqs; __uint128_t *grams; struct gram_bufs gbufs; }; static double get_sample_size(void) { /* TODO remove this env var */ double d = 0.1; if (getenv("TDB_SAMPLE_SIZE")){ char *endptr; d = strtod(getenv("TDB_SAMPLE_SIZE"), &endptr); if (*endptr || d < 0.01 || d > 1.0){ /* TODO fix this */ fprintf(stderr, "Invalid TDB_SAMPLE_SIZE"); d = 0.1; } } return d; } static tdb_error event_fold(event_op op, FILE *grouped, uint64_t num_events, const tdb_item *items, uint64_t num_fields, void *state) { dsfmt_t rand_state; tdb_item *prev_items = NULL; tdb_item *encoded = NULL; uint64_t encoded_size = 0; uint64_t i = 1; double sample_size = 1.0; struct tdb_grouped_event ev; int ret = 0; if (num_events == 0) return 0; dsfmt_init_gen_rand(&rand_state, RANDOM_SEED); /* enable sampling only if there is a large number of events */ if (num_events > NUM_EVENTS_SAMPLING_THRESHOLD) sample_size = get_sample_size(); if (!(prev_items = malloc(num_fields * sizeof(tdb_item)))){ ret = TDB_ERR_NOMEM; goto done; } rewind(grouped); TDB_READ(grouped, &ev, sizeof(struct tdb_grouped_event)); /* this function scans through *all* unencoded data, takes a sample of trails, edge-encodes events for a trail, and calls the given function (op) for each event */ while (i <= num_events){ /* NB: We sample trails, not events, below. We can't encode *and* sample events efficiently at the same time. If data is very unevenly distributed over trails, sampling trails will produce suboptimal results. We could compensate for this by always include all very long trails in the sample. */ uint64_t n, trail_id = ev.trail_id; /* Always include the first trail so we don't end up empty */ if (i == 1 || dsfmt_genrand_close_open(&rand_state) < sample_size){ memset(prev_items, 0, num_fields * sizeof(tdb_item)); while (ev.trail_id == trail_id){ if ((ret = edge_encode_items(items, &encoded, &n, &encoded_size, prev_items, &ev))) goto done; if ((ret = op(encoded, n, &ev, state))) goto done; if (i++ < num_events){ TDB_READ(grouped, &ev, sizeof(struct tdb_grouped_event)); }else break; } }else{ /* given that we are sampling trails, we need to skip all events related to a trail not included in the sample */ for (;i < num_events && ev.trail_id == trail_id; i++) TDB_READ(grouped, &ev, sizeof(struct tdb_grouped_event)); } } done: free(encoded); free(prev_items); return ret; } static tdb_error alloc_gram_bufs(struct gram_bufs *b) { if (!(b->chosen = malloc(b->buf_len * 16))) return TDB_ERR_NOMEM; if (!(b->scores = malloc(b->buf_len * 8))) return TDB_ERR_NOMEM; return 0; } tdb_error init_gram_bufs(struct gram_bufs *b, uint64_t num_fields) { memset(b, 0, sizeof(struct gram_bufs)); if (num_fields){ if (!(b->covered = malloc(num_fields))) return TDB_ERR_NOMEM; b->buf_len = MIN(num_fields * num_fields, INITIAL_GRAM_BUF_LEN); b->num_fields = num_fields; return alloc_gram_bufs(b); } return 0; } void free_gram_bufs(struct gram_bufs *b) { free(b->chosen); free(b->scores); free(b->covered); } /* given a set of edge-encoded values (encoded), choose a set of unigrams and bigrams that cover the original set. In essence, this tries to solve Weigted Exact Cover Problem for the universe of 'encoded'. */ tdb_error choose_grams_one_event(const tdb_item *encoded, uint64_t num_encoded, const struct judy_128_map *gram_freqs, struct gram_bufs *g, __uint128_t *grams, uint64_t *num_grams, const struct tdb_grouped_event *ev) { uint64_t i, j, k, n = 0; Word_t *ptr; uint64_t unigram1 = ev->timestamp; int ret = 0; /* in the worst case we need O(num_fields^2) of memory but typically either num_fields is small or events are sparse, i.e. num_encoded << num_fields, so in practice these shouldn't take a huge amount of space */ if (g->buf_len < num_encoded * num_encoded){ free(g->scores); free(g->chosen); g->buf_len = num_encoded * num_encoded; if ((ret = alloc_gram_bufs(g))) return ret; } memset(g->covered, 0, g->num_fields); /* First, produce all candidate bigrams for this set. */ for (k = 0, i = 0; i < num_encoded; i++){ if (i > 0){ unigram1 = encoded[i]; j = i + 1; }else j = 0; for (;j < num_encoded; j++){ __uint128_t bigram = unigram1; bigram |= ((__uint128_t)encoded[j]) << 64; ptr = j128m_get(gram_freqs, bigram); if (ptr){ g->chosen[k] = bigram; g->scores[k++] = *ptr; } } } /* timestamp *must* be the first item in the list, add unigram as a placeholder - this may get replaced by a bigram below */ grams[n++] = ev->timestamp; /* Pick non-overlapping histograms, in the order of descending score. As we go, mark fields covered (consumed) in the set. */ while (1){ uint64_t max_idx = 0; uint64_t max_score = 0; for (i = 0; i < k; i++) /* consider only bigrams whose both unigrams are non-covered */ if (!(g->covered[tdb_item_field(HUFF_BIGRAM_TO_ITEM(g->chosen[i]))] || g->covered[tdb_item_field(HUFF_BIGRAM_OTHER_ITEM(g->chosen[i]))]) && g->scores[i] > max_score){ max_score = g->scores[i]; max_idx = i; } if (max_score){ /* mark both unigrams of this bigram covered */ __uint128_t chosen = g->chosen[max_idx]; g->covered[tdb_item_field(HUFF_BIGRAM_TO_ITEM(chosen))] = 1; g->covered[tdb_item_field(HUFF_BIGRAM_OTHER_ITEM(chosen))] = 1; if (tdb_item_field(HUFF_BIGRAM_TO_ITEM(chosen))) grams[n++] = chosen; else /* make sure timestamp stays as the first item. This is safe since grams[0] was reserved above for the timestamp. */ grams[0] = chosen; }else /* all bigrams used */ break; } /* Finally, add all remaining unigrams to the result set which have not been covered by any bigrams */ for (i = 0; i < num_encoded; i++) if (!g->covered[tdb_item_field(encoded[i])]) grams[n++] = encoded[i]; *num_grams = n; return ret; } static tdb_error choose_grams(const tdb_item *encoded, uint64_t num_encoded, const struct tdb_grouped_event *ev, void *state){ struct ngram_state *g = (struct ngram_state*)state; uint64_t n; int ret = 0; if ((ret = choose_grams_one_event(encoded, num_encoded, &g->ngram_freqs, &g->gbufs, g->grams, &n, ev))) return ret; while (n--){ /* TODO fix this once j128m returns proper error codes */ Word_t *ptr = j128m_insert(g->final_freqs, g->grams[n]); if (ptr) ++*ptr; else return TDB_ERR_NOMEM; } return 0; } static tdb_error find_candidates(const Pvoid_t unigram_freqs, Pvoid_t *candidates0) { Pvoid_t candidates = NULL; Word_t idx = 0; Word_t *ptr; uint64_t num_values = 0; uint64_t support; /* find all unigrams whose probability of occurrence is greater than UNIGRAM_SUPPORT */ JLF(ptr, unigram_freqs, idx); while (ptr){ num_values += *ptr; JLN(ptr, unigram_freqs, idx); } support = num_values / (uint64_t)(1.0 / UNIGRAM_SUPPORT); idx = 0; JLF(ptr, unigram_freqs, idx); while (ptr){ int tmp; if (*ptr > support) J1S(tmp, candidates, idx); JLN(ptr, unigram_freqs, idx); } *candidates0 = candidates; return 0; out_of_memory: return TDB_ERR_NOMEM; } static tdb_error all_bigrams(const tdb_item *encoded, uint64_t n, const struct tdb_grouped_event *ev, void *state){ struct ngram_state *g = (struct ngram_state *)state; Word_t *ptr; int set; uint64_t i, j; uint64_t unigram1 = ev->timestamp; for (i = 0; i < n; i++){ if (i > 0){ unigram1 = encoded[i]; j = i + 1; }else j = 0; J1T(set, g->candidates, unigram1); if (set){ for (; j < n; j++){ uint64_t unigram2 = encoded[j]; J1T(set, g->candidates, unigram2); if (set){ __uint128_t bigram = unigram1; bigram |= ((__uint128_t)unigram2) << 64; ptr = j128m_insert(&g->ngram_freqs, bigram); if (ptr) ++*ptr; else return TDB_ERR_NOMEM; } } } } return 0; } tdb_error make_grams(FILE *grouped, uint64_t num_events, const tdb_item *items, uint64_t num_fields, const Pvoid_t unigram_freqs, struct judy_128_map *final_freqs, uint64_t no_bigrams) { struct ngram_state g = {.final_freqs = final_freqs}; Word_t tmp; int ret = 0; TDB_TIMER_DEF j128m_init(&g.ngram_freqs); if ((ret = init_gram_bufs(&g.gbufs, num_fields))) goto done; if (!(g.grams = malloc(num_fields * 16))){ ret = TDB_ERR_NOMEM; goto done; } /* below is a very simple version of the Apriori algorithm for finding frequent sets (bigrams) */ /* find unigrams that are sufficiently frequent */ TDB_TIMER_START if ((ret = find_candidates(unigram_freqs, &g.candidates))) goto done; TDB_TIMER_END("encode_model/find_candidates") /* collect frequencies of *all* occurring bigrams of candidate unigrams */ if (!no_bigrams) { TDB_TIMER_START ret = event_fold(all_bigrams, grouped, num_events, items, num_fields, &g); if (ret) goto done; TDB_TIMER_END("encode_model/all_bigrams") } /* TODO: choose_grams below could also be optimized when !no_bigrams is true. */ /* collect frequencies of non-overlapping bigrams and unigrams (exact covering set for each event), store in final_freqs */ TDB_TIMER_START ret = event_fold(choose_grams, grouped, num_events, items, num_fields, &g); if (ret) goto done; TDB_TIMER_END("encode_model/choose_grams") done: #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wsign-compare" J1FA(tmp, g.candidates); #pragma GCC diagnostic pop j128m_free(&g.ngram_freqs); free_gram_bufs(&g.gbufs); free(g.grams); return ret; out_of_memory: return TDB_ERR_NOMEM; } struct unigram_state{ Pvoid_t freqs; }; static tdb_error all_freqs(const tdb_item *encoded, uint64_t n, const struct tdb_grouped_event *ev, void *state){ struct unigram_state *s = (struct unigram_state*)state; Word_t *ptr; while (n--){ JLI(ptr, s->freqs, encoded[n]); ++*ptr; } /* include frequencies for timestamp deltas */ JLI(ptr, s->freqs, ev->timestamp); ++*ptr; return 0; out_of_memory: return TDB_ERR_NOMEM; } Pvoid_t collect_unigrams(FILE *grouped, uint64_t num_events, const tdb_item *items, uint64_t num_fields) { /* calculate frequencies of all items */ struct unigram_state state = {.freqs = NULL}; if (event_fold(all_freqs, grouped, num_events, items, num_fields, &state)) return NULL; else return state.freqs; } traildb-0.6+dfsg1/util/0000700000175000017500000000000013106440271014247 5ustar czchenczchentraildb-0.6+dfsg1/util/traildb_bench.c0000600000175000017500000003005013106440271017173 0ustar czchenczchen#define _DEFAULT_SOURCE #include #include #include #include #include #include "traildb.h" #include "tdb_profile.h" #define REPORT_ERROR(fmt, ...) \ do { \ fprintf(stderr, (fmt), ##__VA_ARGS__); \ } while(0) #define REPORT_WARNING(fmt, ...) \ do { \ fprintf(stderr, ("WARNING: " fmt), ##__VA_ARGS__); \ } while(0) #define TIMED(msg, err, stmt) \ do { \ TDB_TIMER_DEF; \ TDB_TIMER_START; \ (err) = (stmt); \ TDB_TIMER_END("traildb_bench/" msg); \ } while(0) /** * super dirty hack to make the compiler happy * * also: https://www.youtube.com/watch?v=1dV_6EtfvkM */ static inline const char** const_quirk(char** argv) { void* p; memcpy(&p, &argv, sizeof(argv)); return p; } static void dump_hex(const char* raw, uint64_t length) { for(unsigned int i = 0; i+4 <= length; i += 4) { printf("%02x%02x%02x%02x ", raw[i+0], raw[i+1], raw[i+2], raw[i+3]); } for(uint64_t i = length - (length % 4); i < length; ++i) { printf("%02x", raw[i]); } } /** * calls tdb_get_trail over the full tdb */ static int do_get_all_and_decode(const tdb* db, const char* path, tdb_field* ids, unsigned int ids_length) { tdb_error err; tdb_cursor* const c = tdb_cursor_new(db); assert(c); uint64_t items_decoded = 0; const uint64_t num_trails = tdb_num_trails(db); for(uint64_t trail_id = 0; trail_id < num_trails; ++trail_id) { err = tdb_get_trail(c, trail_id); if(err) { REPORT_ERROR("%s: failed to extract trail %" PRIu64 ". error=%i\n", path, trail_id, err); goto err; } const tdb_event* e; while((e = tdb_cursor_next(c))) { for(unsigned int j = 0; j < ids_length; ++j) { uint64_t dummy; const tdb_item item = e->items[ids[j]]; (void)tdb_get_item_value(db, item, &dummy); ++items_decoded; } } } printf("# items decoded: %" PRIu64 "\n", items_decoded); err: tdb_cursor_free(c); return err; } static int cmd_get_all_and_decode(char** dbs, int argc) { for(int i = 0; i < argc; ++i) { const char* path = dbs[i]; tdb* db = tdb_init(); int err = tdb_open(db, path); if(err) { printf("Error code %i while opening TDB at %s\n", err, path); return 1; } const unsigned int nfields = (unsigned)tdb_num_fields(db) - 1u; tdb_field ids[nfields]; for (unsigned int field_id = 0; field_id < nfields; ++field_id) ids[field_id] = field_id; TIMED("get_all", err, do_get_all_and_decode(db, path, ids, nfields)); tdb_close(db); } return 0; } static int resolve_fieldids(tdb_field** field_ids, const tdb* db, const char** field_names, int names_length) { tdb_field* out = malloc((unsigned)names_length * sizeof(tdb_field)); assert(field_ids != NULL); for(int i = 0; i < names_length; ++i) { const int err = tdb_get_field(db, field_names[i], &out[i]); if(err) { REPORT_ERROR("Could not find field name %s\n", field_names[i]); goto err; } } *field_ids = out; return 0; err: free(out); return -1; } static int cmd_decode(const char* path, const char** field_names, int names_length) { tdb* db = tdb_init(); assert(db); tdb_error err = tdb_open(db, path); if(err) { REPORT_ERROR("Failed to open TDB. error=%i\n", err); return 1; } tdb_field* ids; err = resolve_fieldids(&ids, db, field_names, names_length); if(err) goto out; /** actual field names (except for timestamp) */ for(int i = 0; i < names_length; ++i) { assert(ids[i] > 0 && "reading from timestamp column 0 not supported"); --ids[i]; } TIMED("cmd_decode", err, do_get_all_and_decode(db, path, ids, (unsigned)names_length)); out: tdb_close(db); return err; } static int do_recode(tdb_cons* const cons, tdb* const db, tdb_field* const field_ids, const int num_fieldids) { tdb_error err; const uint64_t num_fields = tdb_num_fields(db); const uint64_t num_trails = tdb_num_trails(db); const char** const values = calloc(num_fields, sizeof(char*)); uint64_t* const value_lengths = calloc(num_fields, sizeof(uint64_t)); tdb_cursor* const c = tdb_cursor_new(db); assert(values); assert(value_lengths); assert(c); /* * for each trail * for each "record" in the timeline * extract/decode the relevant fields that are in field_ids * insert the record */ for(uint64_t trail_id = 0; trail_id < num_trails; ++trail_id) { err = tdb_get_trail(c, trail_id); if(err) { REPORT_ERROR("Failed to get trail (trail_id=%" PRIu64 "). error=%i\n", trail_id, err); goto free_mem; } const tdb_event* e; for(uint64_t ev_id = 0; (e = tdb_cursor_next(c)); ++ev_id) { /* extract step */ for(int field = 0; field < num_fieldids; ++field) { assert(0 < field_ids[field]); assert(field_ids[field] <= e->num_items); /** field ids start by '1' as column 0 historically is the timestamp column. With tdb_event, the actual fields are 0-indexed */ values[field] = tdb_get_item_value( db, e->items[field_ids[field] - 1], &value_lengths[field]); } err = tdb_cons_add(cons, tdb_get_uuid(db, trail_id), e->timestamp, values, value_lengths); if(err) { REPORT_ERROR("Failed to append record (trail_id=%" PRIu64 ", ev_id=%" PRIu64 "). error=%i\n", trail_id, ev_id, err); goto free_mem; } } } free_mem: tdb_cursor_free(c); free(value_lengths); free(values); return err; } /** * copies a subset of data from one DB into another one. The subset is * given by field names */ static int cmd_recode(const char* output_path, const char* input, const char** field_names, int names_length) { assert(names_length > 0); tdb* const db = tdb_init(); assert(db); int err = tdb_open(db, input); if(err) { REPORT_ERROR("Failed to open TDB. error=%i\n", err); return 1; } tdb_field* field_ids; err = resolve_fieldids(&field_ids, db, field_names, names_length); if(err < 0) { goto free_tdb; } tdb_cons* const cons = tdb_cons_init(); assert(cons); err = tdb_cons_open(cons, output_path, field_names, (unsigned)names_length); if(err) { REPORT_ERROR("Failed to create TDB cons. error=%i\n", err); goto free_ids; } TIMED("recode", err, do_recode(cons, db, field_ids, names_length)); if(err) goto close_cons; err = tdb_cons_finalize(cons); if(err) { REPORT_ERROR("Failed to finalize output DB. error=%i\n", err); } close_cons: tdb_cons_close(cons); free_ids: free(field_ids); free_tdb: tdb_close(db); return err; } static const char** duplicate_fieldids(const tdb* db) { const uint64_t num_fields = tdb_num_fields(db); assert(num_fields <= TDB_MAX_NUM_FIELDS); const char** fieldids = malloc(num_fields * sizeof(char*)); for(tdb_field i = 1; i < num_fields; ++i) { char* field_id = strdup(tdb_get_field_name(db, i)); assert(field_id); fieldids[i-1] = field_id; } return fieldids; } static int cmd_append_all(const char* output_path, const char* input) { tdb* db = tdb_init(); assert(db); int err = tdb_open(db, input); if(err) { REPORT_ERROR("Failed to open TDB. error=%i\n", err); return 1; } const uint64_t num_fields = tdb_num_fields(db) - 1; const char** field_ids = duplicate_fieldids(db); tdb_cons* cons = tdb_cons_init(); assert(cons); err = tdb_cons_open(cons, output_path, field_ids, num_fields); if(err) { REPORT_ERROR("Failed to create TDB cons. error=%i\n", err); goto free_fieldids; } TIMED("tdb_cons_append()", err, tdb_cons_append(cons, db)); if(err) { REPORT_ERROR("Failed to append DB. error=%i\n", err); goto close_cons; } err = tdb_cons_finalize(cons); if(err) { REPORT_ERROR("Failed to finalize output DB. error=%i\n", err); goto close_cons; } printf("Successfully converted / rewritten DB.\n"); close_cons: tdb_cons_close(cons); free_fieldids: /* to make the compiler not complain about casting const'ness away, let's pull out this small, dirty trick */ for(uint64_t i = 0; i < num_fields; ++i) { void* make_compiler_happy; memcpy(&make_compiler_happy, field_ids + i, sizeof(void*)); free(make_compiler_happy); } free(field_ids); tdb_close(db); return err ? 1 : 0; } static void dump_trail(const tdb* db, const uint8_t* uuid, tdb_cursor* c) { #define HEX4 "%02x%02x%02x%02x" printf("cookie " HEX4 HEX4 HEX4 HEX4 "\n", uuid[0], uuid[1], uuid[2], uuid[3], uuid[4], uuid[5], uuid[6], uuid[7], uuid[8], uuid[9], uuid[10], uuid[11], uuid[12], uuid[13], uuid[14], uuid[15]); #undef HEX4 const tdb_event* e; while((e = tdb_cursor_next(c))) { printf("ts=%" PRIu64 ":\n", e->timestamp); for(uint64_t i = 0; i < e->num_items; ++i) { const char* name = tdb_get_field_name( db, tdb_item_field(e->items[i])); uint64_t v_len = 0; const char* v = tdb_get_item_value( db, e->items[i], &v_len); printf(" %s=", name); dump_hex(v, v_len); putchar('\n'); } putchar('\n'); } } static int cmd_dump(const char* db_path) { tdb* db = tdb_init(); assert(db); int err = tdb_open(db, db_path); if(err) { REPORT_ERROR("Failed to open TDB at directory %s. error=%i\n", db_path, err); return 1; } tdb_cursor* const c = tdb_cursor_new(db); const uint64_t num_trails = tdb_num_trails(db); for(uint64_t trail_id = 0; trail_id < num_trails; ++trail_id) { err = tdb_get_trail(c, trail_id); if(err) { REPORT_ERROR("Failed to decode trail %" PRIu64 ". error=%i\n", trail_id, err); goto out; } dump_trail(db, tdb_get_uuid(db, trail_id), c); putchar('\n'); } out: tdb_cursor_free(c); tdb_close(db); return err; } static int cmd_info(const char* db_path) { tdb* db = tdb_init(); assert(db); int err = tdb_open(db, db_path); if(err) { REPORT_ERROR("Failed to open TDB at directory %s. error=%i\n", db_path, err); return 1; } printf("DB at %s:\n" " version: %" PRIu64 "\n" " #trails: %" PRIu64 "\n" " #events: %" PRIu64 "\n" " #fields: %" PRIu64 "\n" "\n" " min timestamp: %" PRIu64 "\n" " max timestamp: %" PRIu64 "\n", db_path, tdb_version(db), tdb_num_trails(db), tdb_num_events(db), tdb_num_fields(db), tdb_min_timestamp(db), tdb_max_timestamp(db)); printf("\nColumns: \n"); printf(" field[00] = %s (implicit)\n", tdb_get_field_name(db, 0 )); for(unsigned fid = 1; fid < tdb_num_fields(db); ++fid) { printf(" field[%02u] = %s\n", fid, tdb_get_field_name(db, fid)); } tdb_close(db); return err; } static void print_help(void) { printf( "Usage: traildb_bench [*]\n" "\n" "Available commands:\n" " decode-all *\n" " :: iterates over the complete DB, decoding\n" " every value encountered\n" " decode +\n" " :: iterates over the complete DB, decoding\n" " the values of each given field\n" " append-all \n" " :: copies data from one DB into a new or\n" " existing database. Has multiple use-cases,\n" " such as: converting DB formats, merging DBs, etc..\n" " Assumes DBs have identical fieldsets.\n" " recode +\n" " :: copies the given (sub)set of columns from the DB at\n" " /input path/ into /output path/\n" " info \n" " :: displays information on a TDB\n" " dump \n" " :: dumps contents of a traildb in a most primitive way.\n" ); } #define IS_CMD(cmd, nargs) (strcmp(command, (cmd)) == 0 && argc >= (nargs + 2)) int main(int argc, char** argv) { if(argc < 2) { print_help(); return 1; } const char* const command = argv[1]; const char** const cargv = const_quirk(argv); if(IS_CMD("decode-all", 1)) { return cmd_get_all_and_decode(argv + 2, argc - 2); } else if(IS_CMD("decode", 2)) { return cmd_decode(argv[2], cargv + 3, argc - 3); } else if(IS_CMD("append-all", 2)) { return cmd_append_all(argv[2], argv[3]); } else if(IS_CMD("recode", 3)) { return cmd_recode(argv[2], argv[3], cargv + 4, argc - 4); } else if(IS_CMD("info", 1)) { return cmd_info(argv[2]); } else if(IS_CMD("dump", 1)) { return cmd_dump(argv[2]); } else { print_help(); return 1; } return 0; } traildb-0.6+dfsg1/test.tdb0000600000175000017500000502600013106440271014751 0ustar czchenczchenversion0000644000000000000000000000000100000000000007304 0ustar 1info0000644000000000000000000000003500000000000006561 0ustar 2 2 1463696903 1463696952 49 tar.toc0000644000000000000000000000105500000000000007203 0ustar TAR TOC FOR TDB VER 1 version 512 1 info 1536 29 tar.toc 2560 557 lexicon.first_field 4096 23 lexicon.second_field 5120 27 fields 6144 26 trails.codebook 7168 1310720 trails.toc 1318400 12 trails.data 1319424 12 uuids 1320448 32 lexicon.first_field0000644000000000000000000000002700000000000011561 0ustar helloitlexicon.second_field0000644000000000000000000000003300000000000011702 0ustar worldworks!fields0000644000000000000000000000003200000000000007071 0ustar first_field second_field trails.codebook0000644000000000000000000500000000000000000010705 0ustar 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111trails.toc0000644000000000000000000000001400000000000007705 0ustar trails.data0000644000000000000000000000001400000000000010031 0ustar ïÏuuids0000644000000000000000000000004000000000000006753 0ustar  traildb-0.6+dfsg1/doc/0000700000175000017500000000000013106440271014037 5ustar czchenczchentraildb-0.6+dfsg1/doc/docs/0000700000175000017500000000000013106440271014767 5ustar czchenczchentraildb-0.6+dfsg1/doc/docs/tutorial.md0000600000175000017500000002322313106440271017160 0ustar czchenczchen This tutorial expects that you have TrailDB installed and working. If you haven't installed TrailDB yet, see [Getting Started](getting_started) for instructions. # Part I: Create a simple TrailDB In this example, we will create a tiny TrailDB that includes events from three users. You can find the full Python source code in the [traildb-python](https://github.com/traildb/traildb-python/tree/master/examples/tutorial_simple_traildb.py) repo and the C source in the [main traildb repo](https://github.com/traildb/traildb/tree/master/examples/tutorial_simple_traildb.c). Note that opening a new TrailDB constructor fails if there is an existing TrailDB with the same name. If you run this example multiple times, you should delete the `tiny` directory, which may contain partial results, and `tiny.tdb` before running the example. First, let's create a new constructor that we will use to populate the TrailDB. The TrailDB will have two fields, `username` and `action`, which we will specify when creating the constructor.
```python from traildb import TrailDBConstructor, TrailDB from uuid import uuid4 from datetime import datetime cons = TrailDBConstructor('tiny', ['username', 'action']) ``` ```C #include #include #include #include int main(int argc, char **argv) { const char *fields[] = {"username", "action"}; tdb_error err; tdb_cons* cons = tdb_cons_init(); if ((err = tdb_cons_open(cons, "tiny", fields, 2))){ printf("Opening TrailDB constructor failed: %s\n", tdb_error_str(err)); exit(1); } ```
Now we can populate the TrailDB with events. We are going to create three dummy users, each of which will have three events. Note that the primary key identifying the user is a [UUID](http://en.wikipedia.org/wiki/UUID). We can use the `uuid` module in Python to generate UUIDs, or you can create your own identifiers like the C code does.
```python for i in range(3): uuid = uuid4().hex username = 'user%d' % i for day, action in enumerate(['open', 'save', 'close']): cons.add(uuid, datetime(2016, i + 1, day + 1), (username, action)) ``` ```C static char username[6]; static uint8_t uuid[16]; const char *EVENTS[] = {"open", "save", "close"}; uint32_t i, j; /* create three users */ for (i = 0; i < 3; i++){ memcpy(uuid, &i, 4); sprintf(username, "user%d", i); /* every user has three events */ for (j = 0; j < 3; j++){ const char *values[] = {username, EVENTS[j]}; uint64_t lengths[] = {strlen(username), strlen(EVENTS[j])}; /* generate a dummy timestamp */ uint64_t timestamp = i * 10 + j; if ((err = tdb_cons_add(cons, uuid, timestamp, values, lengths))){ printf("Adding an event failed: %s\n", tdb_error_str(err)); exit(1); } } } ```
Once you are done adding events in the TrailDB, you have to finalize it. Finalization takes care of compacting the events and creating a valid TrailDB file.
```python cons.finalize() ``` ```C if ((err = tdb_cons_finalize(cons))){ printf("Closing TrailDB constructor failed: %s\n", tdb_error_str(err)); exit(1); } tdb_cons_close(cons); ```
You can check the contents of the new TrailDB using the `tdb` tool by running `tdb dump -i tiny`. We can easily print out its contents using the API too:
```python for uuid, trail in TrailDB('tiny').trails(): print uuid, list(trail) ``` ```C tdb* db = tdb_init(); if ((err = tdb_open(db, "tiny"))){ printf("Opening TrailDB failed: %s\n", tdb_error_str(err)); exit(1); } tdb_cursor *cursor = tdb_cursor_new(db); /* loop over all trails */ for (i = 0; i < tdb_num_trails(db); i++){ const tdb_event *event; uint8_t hexuuid[32]; tdb_uuid_hex(tdb_get_uuid(db, i), hexuuid); printf("%.32s ", hexuuid); tdb_get_trail(cursor, i); /* loop over all events of this trail */ while ((event = tdb_cursor_next(cursor))){ printf("[ timestamp=%llu", event->timestamp); for (j = 0; j < event->num_items; j++){ uint64_t len; const char *val = tdb_get_item_value(db, event->items[j], &len); printf(" %s=%.*s", fields[j], len, val); } printf(" ] "); } printf("\n"); } ```
That's it! You can easily extend this example for creating TrailDBs based on event sources of your own. # Part II: Analyze a large TrailDB of Wikipedia edits Wikipedia provides a [database dump](https://dumps.wikimedia.org/enwiki/) of the full edit history of Wikipedia pages. This is a treasure trove of data that can be used to analyze, for instance, behavior of individual contributors or edit history of individual pages. We [converted the 50GB compressed dump to a TrailDB](https://github.com/traildb/traildb-python/tree/master/examples/parse_wikipedia_history.py). For this tutorial, you should download the pre-made TrailDB. Two versions are provided: - [wikipedia-history.tdb](http://traildb.io/data/wikipedia-history.tdb) contains the full edit history of Wikipedia between January 2001 and May 2016. This TrailDB contains trails for 44M contributors, covering 663M edit actions. The size of the file is 5.8GB. - [wikipedia-history-small.tdb](http://traildb.io/data/wikipedia-history-small.tdb) contains a random sample of 1% contributors (103MB). If you are curious, [this script](https://github.com/traildb/traildb-python/tree/master/examples/extract_sample.py) was used to produce a random extract of the full TrailDB. First, you should download the smaller snapshot above, `wikipedia-history-small.tdb`, which allows you to verify quickly that the code works. Python is convenient for small and medium-scale analysis but it tends to be slow with larger amounts of data. For analyzing the full `wikipedia-history.tdb`, we recommend that you use C, D, Go or Haskell bindings of TrailDB. #### Number of sessions by contributor Trails in the Wikipedia TrailDBs include all edit actions of each Wikipedia contributor. Contributors include both anonymous contributors who are identified by the IP address (field `ip`) and registered contributors who have a username (field `user`). Each event includes also a `title` of the page that was edited and the timestamp of the edit action. To measure contributor activity, it is useful to count the number of *edit sessions*, in addition to the raw number of edits. We define a *session* as a sequence of actions where actions are at most 30 minutes apart, similar to how sessions are defined in web analytics. Counting the number of sessions by contributor is easy with TrailDB. You can find the full Python source code in the [traildb-python](https://github.com/traildb/traildb-python/tree/master/examples/tutorial_wikipedia_sessions.py) repo and the C source in the [main traildb repo](https://github.com/traildb/traildb/tree/master/examples/tutorial_wikipedia_sessions.c).
```python def sessions(tdb): for i, (uuid, trail) in enumerate(tdb.trails(only_timestamp=True)): prev_time = trail.next() num_events = 1 num_sessions = 1 for timestamp in trail: if timestamp - prev_time > SESSION_LIMIT: num_sessions += 1 prev_time = timestamp num_events += 1 print 'Trail[%d] Number of Sessions: %d Number of Events: %d' %\ (i, num_sessions, num_events) ``` ```C tdb_cursor *cursor = tdb_cursor_new(db); uint64_t i; for (i = 0; i < tdb_num_trails(db); i++){ const tdb_event *event; tdb_get_trail(cursor, i); event = tdb_cursor_next(cursor); uint64_t prev_time = event->timestamp; uint64_t num_sessions = 1; uint64_t num_events = 1; while ((event = tdb_cursor_next(cursor))){ if (event->timestamp - prev_time > SESSION_LIMIT) ++num_sessions; prev_time = event->timestamp; ++num_events; } printf("Trail[%llu] Number of Sessions: %llu Number of Events: %llu\n", i, num_sessions, num_events); } ```
The code loops over all trails and measures the time between actions. If the time exceeds 30 minutes, we increment the session counter. Note that the Python code sets `only_timestamp=True` which makes the cursor return only timestamps instead of the full events. This is a performance optimization that removes unnecessary allocations in the inner loop which are particularly expensive in Python. The code outputs the number of sessions and the number of events for each contributor. We can plot a histogram of the results: Unsurprisingly, the vast majority of contributors have only one session. However, there is a very long tail of contributors who have over 200 sessions. Not all contributors are human beings. There are a number of benevolent bots making routine edits in Wikipedia, such as maintaining basic statistics. In fact, in `wikipedia-history.tdb` you can find over 4500 users whose name ends with `bot`. As a fun follow up exercise, you can write a script that tries to detect bots based on their behavior that is often very characteristic and easy to distinguish from human contributors. traildb-0.6+dfsg1/doc/docs/technical_overview.md0000600000175000017500000003105513106440271021177 0ustar czchenczchen# Data Model traildb : - TrailDB is a collection of **trails**. trail : - A **trail** is identified by a user-defined [128-bit UUID](http://en.wikipedia.org/wiki/UUID) as well as an automatically assigned trail ID. - A **trail** consists of a sequence of **events**, ordered by time. event : - An **event** corresponds to an event in time, related to an UUID. - An **event** has a 64-bit integer timestamp. - An **event** consists of a pre-defined set of **fields**. field : - Each TrailDB follows a schema that is a list of fields. - A **field** consists of a set of values. In a relational database, **UUID** would be the primary key, **event** would be a row, and **fields** would be the columns. The combination of a field ID and a value of the field is represented as an **item** in the [C API](api). Items are encoded as 64-bit integers, so they can be manipulated efficiently. See [Working with items, fields, and values](api/#working-with-items-fields-and-values) for detailed API documentation. # Performance Best Practices ###### Constructing a TrailDB is more expensive than reading it [Constructing a new TrailDB](api/#traildb-construction) consumes CPU, memory, and temporary disk space roughly O(N) amount where N is the amount of raw data. It is typical to separate the relatively expensive construction phase from processing so enough resources can be dedicated to it. ###### Use tdb_cons_append for merging TrailDBs In some cases it makes sense to construct smaller TrailDB shards and later merge them to larger TrailDBs using the [`tdb_cons_append()`](api/#tdb_cons_append) function. Using this function is more efficient than looping over an existing TrailDB and creating a new one using [`tdb_cons_add()`](api/#tdb_cons_add). ###### Mapping strings to items is a relatively slow O(L) operation Mapping a string to an item using [`tdb_get_item()`](api/#tdb_get_item) is an O(L) operation where L is the number of distinct values in the field. The inverse operation, mapping an item to a string using [`tdb_get_item_value()`](api/#tdb_get_item_value) is a fast O(1) operation. ###### Mapping an UUID to a trail ID is a relatively fast O(log N) operation Mapping an UUID to a trail ID using [`tdb_get_trail_id()`](api/#tdb_get_trail_id) is a relatively fast O(log N) operation where N is the number of trails. The inverse operation, mapping a trail ID to UUID using [`tdb_get_uuid()`](api/#tdb_get_uuid) is a fast O(1) operation. ###### Use multiple `tdb` handles for thread-safety A `tdb` handle returned by [`tdb_init()`](api/#tdb_init) is not thread-safe. You need to call [`tdb_init()`](api/#tdb_init) and [`tdb_open()`](api/#tdb_open) in each thread separately. The good news is that these operations are very cheap and data is shared, so the memory overhead is negligible. ###### Cursors are cheap Cursors are cheap to create with [`tdb_cursor_new()`](api/#tdb_cursor_new). The only overhead is a small internal lookahead buffer whose size can be controlled with the `TDB_OPT_CURSOR_EVENT_BUFFER_SIZE` option, if you need to maintain a very large number of parallel cursors. ###### TrailDBs larger than the available RAM A typical situation is that you have a large amount of SSD (disk) space compared to the amount of available RAM. If the size of the TrailDBs you need to process exceeds the amount of RAM, performance may suffer or you may need to complicate the application logic by opening only a subset of TrailDBs at once. An alternative solution is to open all TrailDBs using [`tdb_open()`](api/#tdb_open) as usual, which doesn't consume much memory per se, process some of the data, and then tag inactive TrailDBs with [`tdb_dontneed()`](api/#tdb_dontneed) which signals to the operating system that the memory can be paged. When TrailDBs are needed again, you can call [`tdb_willneed()`](api/#tdb_willneed). A benefit of this approach compared to opening and closing TrailDBs explicitly is that all cursors, pointers etc. stay valid, so they can be kept in memory without complex resource management. ###### Conserve memory with 32-bit items A core concept in the TrailDB API is an *item*, represented by a 64-bit integer. You can use these items in your own application, outside TrailDB, to make data structures simpler and processing faster, compared to using strings. Only when really needed, you can convert items back to strings using e.g. [`tdb_get_item_value()`](api/#tdb_get_item_value). If your application stores and processes a very large number of items, you can cut the memory consumption potentially by 50% by using 32-bit items instead of the standard 64-bit items. TrailDB makes this operation seamless and free: You can call the [`tdb_item_is32()`](api/#tdb_item_is32) macro to make sure the item is safe to cast to `uint32_t` -- if the result is yes, you can cast the item to `uint32_t` and use them transparently in place of the 64-bit items. ###### Finding distinct values efficiently in a trail Sometimes you are only interested in distinct values of a trail, e.g. the set of pages visited by a user. Since trails may contain many duplicate items, which are not interesting in this case, you can speed up processing by setting `TDB_OPT_ONLY_DIFF_ITEMS` with [`tdb_set_opt()`](api/#tdb_set_opt) which makes the cursor remove most of the duplicates. Since this operation is based on the internal compression scheme, it is not fully accurate. You still want to construct a proper set structure in your application but this option can make populating the set much faster. ###### Return a subset of events with event filters [Event filters](api/#event_filters) are a powerful feature for querying a subset of events in trails. Event filters support boolean queries over fields, expressed in [conjunctive normal form](http://en.wikipedia.org/wiki/Conjunctive_normal_form). For instance, you could query certain web browsing events with ``` action=page_view AND (page=pricing OR page=about) ``` First, you need to construct a query using [`tdb_event_filter_add_term`](api/#tdb_event_filter_add_term), which adds terms to OR clauses, and [`tdb_event_filter_new_clause`](api/#tdb_event_filter_new_clause) which adds a new clause that is connected by AND to the previous clauses. Once the filter has been constructed, you can apply it to a cursor with [`tdb_cursor_set_event_filter()`](api/#tdb_cursor_set_event_filter). After this, the cursor returns only events that match the query. Internally, the cursor still needs to evaluate every event but filters may speed up processing by discarding unwanted events on the fly. Note that the `tdb_event_filter` handle needs to be available as long as you keep using the cursor. You can use the same filter in multiple cursors. If you want to use the same filter in all cursors, call [`tdb_set_opt`](api/#tdb_set_opt) with the key `TDB_OPT_EVENT_FILTER`. This makes sure that all cursors created with this TrailDB handle will get the filter applied. You can still override the filter at the cursor level with [`tdb_cursor_set_event_filter()`](api/#tdb_cursor_set_event_filter). In effect, this defines [a view](https://en.wikipedia.org/wiki/View_(SQL)) to TrailDB. See also the entry below about materialized views. ###### Whitelist or blacklist trails (a view over a subset of trails) A special case of event filters, introduced above, are filters that match all events or no events. These filters are most commonly used to make TrailDB return only a subset of trails. To whitelist a subset of trails, do the following: First disable all trails using [`tdb_event_filter_new_match_none()`](api/#tdb_event_filter_new_match_none). Then enable selected trails with [`tdb_event_filter_new_match_all()`](api/#tdb_event_filter_new_match_all). Here is an example: ```c struct tdb_event_filter *empty = tdb_event_filter_new_match_none(); struct tdb_event_filter *all = tdb_event_filter_new_match_all(); tdb_opt_value value = {.ptr = empty}; /* first blacklist all */ tdb_set_opt(db, TDB_OPT_EVENT_FILTER, value); /* then whitelist selected trails */ value.ptr = all; for (i = 0; i < num_selected; i++) tdb_set_trail_opt(db, trail_ids[i], TDB_OPT_EVENT_FILTER, value); ``` If you want to blacklist a subset of trails instead, you can just call `tdb_set_trail_opt` with an empty filter. A benefit of this approach that `empty` and `all` are optimized internally so that no events will be evaluated for trails that have these filters enabled. You can use these filters to extract a new TrailDB that contains only a subset of trails from the source TrailDB, as described below. ###### Create TrailDB extracts (materialized views) It is possible to extract a subset of events from an existing TrailDB to a new TrailDB. A benefit of doing this is that the new TrailDB is smaller than the original, which can make queries faster. To create an extract (a materialized view), open an existing TrailDB as usual. Create [an event filter](api/#event_filters) that defines the subset of events, and set it to the TrailDB handle with [`tdb_set_opt`](api/#tdb_set_opt) using the key `TDB_OPT_EVENT_FILTER`. Now you can create the extract by calling [tdb_cons_open()](api/#tdb_cons_open) as usual, followed by [tdb_cons_append()](api/#tdb_cons_append) using the TrailDB handle initialized above. The [tdb_cons_append()](api/#tdb_cons_append) function will add only events matching with the filter to the new TrailDB. You can also create materialized views on the command line with the `tdb merge` command. ###### Join trails over multiple TrailDBs It is a common pattern to shard TrailDBs by time, e.g. by day. Now if you want to return the full trail of a user over K days, you need to handle each of the K TrailDBs separately. [Multi-cursors](api/#join-trails-with-multi-cursors) provide a convenient way to stich together trails in multiple (or a single) TrailDB(s). You need to initialize the cursors for each of the K TrailDBs as usual, pass them to a multi-cursor instance, and use the multi-cursor to iterate over a joined trail seamlessly. Multi-cursors work even if the underlying cursors are overlapping in time since they perform an efficient `O(Kn)` merge sort on the fly. Note that you can also apply event filters to the underlying cursors to produce a joined trail over a subset of events. # Limits - Maximum number of trails: 259 - 1 - Maximum number of events in a trail: 250 - 1 - Maximum number of distinct values in a field: 240 - 2 - Maximum number of fields: 16,382 - Maximum size of a value: 258 bytes # Internals TrailDB uses a number of different compression methods to minimize the amount of space used. In contrast to the usual approach of compressing a file for storage, and decompressing it for processing, TrailDB compresses the data when TrailDB is constructed but it never decompresses all the data again. Only the parts that are actually requested, using a lazy cursor, are decompressed on the fly. This makes TrailDB cache-friendly and hence fast to query. The data model of TrailDB enables efficient compression: Since events are grouped by UUID, which typically corresponds a user, a server, or other such logical entity, events within a trail tend to be somewhat predictable - each logical entity tends to behave in its own characteristic way. We can leverage this by only encoding every change in behavior, similar to [run-length encoding](http://en.wikipedia.org/wiki/Run-length_encoding). Another observation is that since in the TrailDB data model events are always sorted by time, we can utilize [delta-encoding](http://en.wikipedia.org/wiki/Delta_encoding) to compress 64-bit timestamps that otherwise would end up consuming a non-trivial amount of space. After these baseline transformations, we can observe that the distribution of values is typically very skewed: Some items are way more frequent than others. We also analyze the distribution of pairs of values (bigrams) for similar skewedness. The skewed, low-entropy distributions of values can be efficiently encoded using [Huffman coding](http://en.wikipedia.org/wiki/Huffman_coding). Fields that are not good candidates for entropy encoding are encoded using simpler variable-length integers. The end result is often comparable to compressing the data using Zip. The big benefit of TrailDB compared to Zip is that each trail can be decoded individually efficiently and the output of decoding is kept in an efficient integer-based format. By design, the TrailDB API encourages the use of integer-based items instead of original strings for further processing, making it easy to build high-performance applications on top of TrailDB. traildb-0.6+dfsg1/doc/docs/api.md0000600000175000017500000006456513106440271016104 0ustar czchenczchen # Functions [TOC] # Construct a new TrailDB ### tdb_cons_init Create a new TrailDB constructor handle. ```c tdb_cons *tdb_cons_init(void) ``` Return NULL if memory allocation fails. ### tdb_cons_open Open a new TrailDB. ```c tdb_error tdb_cons_open(tdb_const *cons, const char *root, const char **ofield_names, uint64_t num_ofields) ``` * `cons` constructor handle as returned from `tdb_cons_init`. * `root` path to new TrailDB. * `ofield_names` names of fields, each name terminated by a zero byte. * `num_ofields` number of fields. Return 0 on success, an error code otherwise. ### tdb_cons_close Free a TrailDB constructor handle. Call this after [tdb_cons_finalize()](#tdb_cons_finalize). ```c void tdb_cons_close(tdb_cons *cons) ``` * `cons` TrailDB constructor handle. ### tdb_cons_add Add an event to TrailDB. ```c tdb_error tdb_cons_add(tdb_cons *cons, const uint8_t uuid[16], const uint64_t timestamp, const char **values, const uint64_t *value_lengths) ``` * `cons` TrailDB constructor handle. * `uuid` 16-byte UUID. * `timestamp` integer timestamp. Usually Unix time. * `values` values of each field, as an array of pointers to byte strings. The order of values is the same as `ofield_names` in [tdb_cons_open()](#tdb_cons_open). * `values_length` lengths of byte strings in `values`. Return 0 on success, an error code otherwise. ### tdb_cons_append Merge an existing TrailDB to this constructor. The fields must be equal between the existing and the new TrailDB. ```c tdb_error tdb_cons_append(tdb_cons *cons, const tdb *db) ``` * `cons` TrailDB constructor handle. * `db` An existing TrailDB to be merged. Return 0 on success, an error code otherwise. ### tdb_cons_set_opt Set a constructor option. ```c tdb_error tdb_cons_set_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value value); ``` Currently the supported options are: * key `TDB_OPT_CONS_OUTPUT_FORMAT` - value `TDB_OPT_CONS_OUTPUT_PACKAGE` create a one-file TrailDB (default). - value `TDB_OPT_CONS_OUTPUT_DIR` do not package TrailDB, keep a directory. * key `TDB_OPT_CONS_NO_BIGRAMS` - value `0` to enable bigram-based size optimization at TrailDB finalization (default). This decreases the size of resulting TrailDB at the cost of increased compression time. - value `1` to disable bigram-based size optimization at TrailDB finalization. Return 0 on success, an error code otherwise. ### tdb_cons_get_opt Get a constructor option. ```c tdb_error tdb_cons_get_opt(tdb_cons *cons, tdb_opt_key key, tdb_opt_value *value); ``` See [tdb_cons_set_opt()](#tdb_cons_set_opt) for valid keys. Sets the `value` to the current value of the key. Return 0 on success, an error code otherwise. ### tdb_cons_finalize Finalize TrailDB construction. Finalization takes care of compacting the events and creating a valid TrailDB file. ```c tdb_error tdb_cons_finalize(tdb_cons *cons) ``` * `cons` TrailDB constructor handle. Return 0 on success, an error code otherwise. # Open a TrailDB and access metadata ### tdb_init Create a new TrailDB handle. ```c tdb *tdb_init(void) ``` Return NULL if memory allocation fails. ### tdb_open Open a TrailDB for reading. ```c tdb_error tdb_open(tdb *tdb, const char *root) ``` * `tdb` Traildb handle returned by [tdb_init()](#tdb_init). * `root` path to TrailDB. Return 0 on success, an error code otherwise. ### tdb_close Close a TrailDB. ```c void tdb_close(tdb *db) ``` * `db` TrailDB handle. ### tdb_dontneed Inform the operating system that this TrailDB does not need to be kept in memory. ```c void tdb_dontneed(tdb *db) ``` * `db` TrailDB handle. ### tdb_willneed Inform the operating system that this TrailDB will be accessed soon. Call this after [tdb_dontneed()](#tdb_dontneed) once the TrailDB is needed again. ```c void tdb_willneed(tdb *db) ``` * `db` TrailDB handle. ### tdb_num_trails Get the number of trails. ``` uint64_t tdb_num_trails(const tdb *db) ``` * `db` TrailDB handle. ### tdb_num_events Get the number of events. ```c uint64_t tdb_num_events(const tdb *db) ``` * `db` TrailDB handle. ### tdb_num_fields Get the number of fields. ``` uint64_t tdb_num_fields(const tdb *db) ``` * `db` TrailDB handle. ### tdb_min_timestamp Get the oldest timestamp. ``` uint64_t tdb_min_timestamp(const tdb *db) ``` * `db` TrailDB handle. ### tdb_max_timestamp Get the newest timestamp. ```c uint64_t tdb_max_timestamp(const tdb *db) ``` * `db` TrailDB handle. ### tdb_version Get the version. ```c uint64_t tdb_version(const tdb *db) ``` * `db` TrailDB handle. ### tdb_error_str Translate an error code to a string. ``` const char *tdb_error_str(tdb_error errcode) ``` Return a string description corresponding to the error code. The string is owned by TrailDB so the caller does not need to free it. # Setting Options TrailDB supports cascading options. You can set top-level options with [tdb_set_opt()](#tdb_set_opt) which are inherited by all operations performed with the handle. You can override top-level options for individual trails using [tdb_set_trail_opt()](#tdb_set_trail_opt). Finally you can override top-level and trail-level event filters with [tdb_cursor_set_event_filter()](#tdb_cursor_set_event_filter). ### tdb_set_opt Set a top-level option. ```c tdb_error tdb_set_opt(tdb *db, tdb_opt_key key, tdb_opt_value value); ``` Currently the supported options are: * key `TDB_OPT_ONLY_DIFF_ITEMS` - value: `0` - Cursors should return all items (default). - value: `1` - Cursors should return mostly distinct items. * key `TDB_OPT_CURSOR_EVENT_BUFFER_SIZE` - value: `number of events` - Set the size of the cursor readahead buffer. * key `TDB_OPT_EVENT_FILTER` - value: pointer to `const struct tdb_event_filter*` as returned by [tdb_event_filter_new()](#tdb_event_filter_new). The filter is applied to all cursors that use this `db` handle at the next call to [tdb_get_trail()](#tdb_get_trail). The event filter must stay alive for the lifetime of the `db` handle or until the filter is disabled by calling this function with `value.ptr = NULL`. Return 0 on success, an error code otherwise. ### tdb_get_opt Get a top-level option. ```c tdb_error tdb_get_opt(tdb *db, tdb_opt_key key, tdb_opt_value *value); ``` See [tdb_set_opt()](#tdb_set_opt) for valid keys. Sets the `value` to the current value of the key. Return 0 on success, an error code otherwise. ### tdb_set_trail_opt Set a trail-level option. These options override top-level options set with [tdb_set_opt()](#tdb_set_opt) for an individual trail at `trail_id`. ```c tdb_error tdb_set_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value value); ``` Currently the supported options are: * key `TDB_OPT_EVENT_FILTER` - value: pointer to `const struct tdb_event_filter*` as returned by [tdb_event_filter_new()](#tdb_event_filter_new). The filter is applied to all cursors that use this `db` handle at the next call to [tdb_get_trail()](#tdb_get_trail). The event filter must stay alive for the lifetime of the `db` handle or until the filter is disabled by calling this function with `value.ptr = NULL`. Return 0 on success, an error code otherwise. ### tdb_get_trail_opt Get a trail-level option ```c tdb_error tdb_get_trail_opt(tdb *db, uint64_t trail_id, tdb_opt_key key, tdb_opt_value *value); ``` See [tdb_set_trail_opt()](#tdb_set_trail_opt) for valid keys. Sets the `value` to the current value of the key. Return 0 on success, an error code otherwise. # Working with items, fields and values See [TrailDB Data Model](technical_overview/#data-model) for a description of items, fields, and values. For maximum performance, it is a good idea to use `tdb_item`s as extensively as possible in your application when working with TrailDBs. Convert items to strings only when really needed. ### tdb_item_field Extract the field ID from an item. ```c tdb_field tdb_item_field(tdb_item item) ``` * `item` an item. Return a field ID. ### tdb_item_val Extract the value ID from an item. ```c tdb_val tdb_item_val(tdb_item item) ``` * `item` an item. Return a value ID. ### tdb_make_item Make an item given a field ID and a value ID. ```c tdb_item tdb_make_item(tdb_field field, tdb_val val) ``` * `field` field ID. * `val` value ID. Return a new item. ### tdb_item_is32 Determine if an item can be safely cast to a 32-bit integer. You can use this function to help to conserve memory by casting items to 32-bit integers instead of default 64-bit items. ```c int tdb_item_is32(tdb_item item) ``` * `item` an item Return non-zero if you can cast this item to 32-bit integer without loss of data. ### tdb_lexicon_size Get the number of distinct values in the given field. ```c uint64_t tdb_lexicon_size(const tdb *db, tdb_field field); ``` * `db` TrailDB handle. * `field` field ID. Returns the number of distinct values. ### tdb_get_field Get the field ID given a field name. ```c tdb_error tdb_get_field(tdb *db, const char *field_name, tdb_field *field) ``` * `db` TrailDB handle. * `field_name` field name (zero-terminated string). * `field` pointer to variable to store field ID in. Return 0 on success, an error code otherwise (field not found). ### tdb_get_field_name Get the field name given a field ID. ```c const char *tdb_get_field_name(tdb *db, tdb_field field) ``` * `db` TrailDB handle. * `field` field ID. Return the field name or NULL if field ID is invalid. The string is owned by TrailDB so the caller does not need to free it. ### tdb_get_item Get the item corresponding to a value. Note that this is a relatively slow operation that may need to scan through all values in the field. ```c tdb_item tdb_get_item(tdb *db, tdb_field field, const char *value, uint64_t value_length) ``` * `db` TrailDB handle. * `field` field ID. * `value` value byte string. * `value_length` length of the value. Return 0 if item was not found, a valid item otherwise. ### tdb_get_value Get the value corresponding to a field ID and value ID pair. ```c const char *tdb_get_value(tdb *db, tdb_field field, tdb_val val, uint64_t *value_length) ``` * `db` TrailDB handle. * `field` field ID. * `val` value ID. * `value_length` length of the returned byte string. Return a byte string corresponding to the field-value pair or NULL if value was not found. The string is owned by TrailDB so the caller does not need to free it. ### tdb_get_item_value Get the value corresponding to an item. This is a shorthand version of [tdb_get_value()](#tdb_get_value). ```c const char *tdb_get_item_value(tdb *db, tdb_item item, uint64_t *value_length) ``` * `db` TrailDB handle. * `item` an item. * `value_length` length of the returned byte string. Return a byte string corresponding to the field-value pair or NULL if value was not found. The string is owned by TrailDB so the caller does not need to free it. # Working with UUIDs Each trail has a user-defined [16-byte UUID](http://en.wikipedia.org/wiki/UUID) and a sequential 64-bit trail ID associated to it. ### tdb_get_uuid Get the UUID given a trail ID. This is a fast O(1) operation. ```c const uint8_t *tdb_get_uuid(const tdb *db, uint64_t trail_id) ``` * `db` TrailDB handle. * `trail_id` trail ID (an integer between 0 and [tdb_num_trails()](#tdb_num_trails)). Return a raw 16-byte UUID or NULL if trail ID is invalid. ### tdb_get_trail_id Get the trail ID given a UUID. This is an O(log N) operation. ```c tdb_error tdb_get_trail_id(const tdb *db, const uint8_t uuid[16], uint64_t *trail_id) ``` * `db` TrailDB handle. * `uuid` a raw 16-byte UUID. * `trail_id` output pointer to the trail ID. Return 0 if UUID was found, an error code otherwise. ### tdb_uuid_raw Translate a 32-byte hex-encoded UUID to a 16-byte UUID. ```c tdb_error tdb_uuid_raw(const uint8_t hexuuid[32], uint8_t uuid[16]) ``` * `hexuuid` source 32-byte hex-encoded UUID. * `uuid` destination 16-byte UUID. Return 0 on success, an error code if `hexuuid` is not a valid hex-encoded string. ### tdb_uuid_hex Translate a 16-byte UUID to a 32-byte hex-encoded UUID. ``` void tdb_uuid_hex(const uint8_t uuid[16], uint8_t hexuuid[32]) ``` * `uuid` source 16-byte UUID. * `hexuuid` destination 32-byte hex-encoded UUID. # Query events with cursors ### tdb_cursor_new Create a new cursor handle. ```c tdb_cursor *tdb_cursor_new(const tdb *db) ``` * `db` TrailDB handle. Return NULL if memory allocation fails. ### tdb_cursor_free Free a cursor handle. ```c void tdb_cursor_free(tdb_cursor *cursor) ``` ### tdb_get_trail Reset the cursor to the given trail ID. ```c tdb_error tdb_get_trail(tdb_cursor *cursor, uint64_t trail_id) ``` * `cursor` cursor handle. * `trail_id` trail ID (an integer between 0 and [tdb_num_trails()](#tdb_num_trails)). Return 0 or an error code if trail ID is invalid. ### tdb_get_trail_length Get the number of events remaining in this cursor. ```c uint64_t tdb_get_trail_length(tdb_cursor *cursor); ``` * `cursor` cursor handle. Return the number of events in this cursor. Note that this function consumes the cursor. You need to reset it with [tdb_get_trail()](#tdb_get_trail) to get more events. ### tdb_cursor_set_event_filter Set an event filter for the cursor. See [filter events](#filter-events) for more information about event filters. ```c tdb_error tdb_cursor_set_event_filter(tdb_cursor *cursor, const struct tdb_event_filter *filter); ``` * `cursor` cursor handle. * `filter` filter handle. Return 0 on success or an error if this cursor does not support event filtering (`TDB_OPT_ONLY_DIFF_ITEMS` is enabled). Note that this function borrows `filter` so it needs to stay alive as long as the cursor is being used. You can use the same `filter` in multiple cursors. ### tdb_cursor_unset_event_filter Remove an event filter from the cursor. ```c void tdb_cursor_unset_event_filter(tdb_cursor *cursor); ``` * `cursor` cursor handle. ### tdb_cursor_next Consume the next event from the cursor. ```c const tdb_event *tdb_cursor_next(tdb_cursor *cursor) ``` * `cursor` cursor handle. Return an event struct or NULL if the cursor has no more events. The event structure is defined as follows: ```c typedef struct{ uint64_t timestamp; uint64_t num_items; const tdb_item items[0]; } tdb_event; ``` `tdb_event` represents one event in the trail. Each event has a timestamp, and a number of field-value pairs, encoded as items. ### tdb_cursor_peek Return the next event from the cursor without consuming it. ```c const tdb_event *tdb_cursor_peek(tdb_cursor *cursor) ``` * `cursor` cursor handle. See [tdb_cursor_next](#tdb_cursor_next) for more details about `tdb_event`. # Join trails with multi-cursors A multi-cursor merges multiple trails represented by `tdb_cursor` together to produce a single merged trail that has its events sorted in the ascending timestamp order. The trails can originate from a single TrailDB or multiple separate TrailDBs. In effect, a multi-cursor performs efficient merge sort of the underlying trails on the fly. You need to initialize all underlying `tdb_cursor`s to point at the desired trails with [tdb_get_trail](#tdb_get_trail) as usual. Then, call [tdb_multi_cursor_reset](#tdb_multi_cursor_reset) to reset the multi-cursor state. After this, you can iterate over the joined trail with [tdb_multi_cursor_next](#tdb_multi_cursor_next), event by event, or get multiple joined events with a single call using [tdb_multi_cursor_next_batch](#tdb_multi_cursor_next_batch). You can repeat these steps for arbitrarily many trails using the same handles. ### tdb_multi_cursor_new Create a new multi-cursor handle. ```c tdb_multi_cursor *tdb_multi_cursor_new(tdb_cursor **cursors, uint64_t num_cursors) ``` * `cursors` a list of cursors to be merged. * `num_cursors` number of cursors in `cursors` Return NULL if memory allocation fails. ### tdb_multi_cursor_free Free a multi-cursor handle. ```c void tdb_multi_cursor_free(tdb_multi_cursor *mcursor) ``` * `mcursor` a multi-cursor handle ### tdb_multi_cursor_reset Reset a multi-cursor handle to reflect the state of the underlying cursors. Call this function every time after [tdb_get_trail](#tdb_get_trail). ```c void tdb_multi_cursor_reset(tdb_multi_cursor *mcursor); ``` * `mcursor` a multi-cursor handle ### tdb_multi_cursor_next Consume the next event, in the ascending timestamp order, from the underlying cursors. ```c const tdb_multi_event *tdb_multi_cursor_next(tdb_multi_cursor *mcursor) ``` * `mcursor` a multi-cursor handle Return a multi event struct or NULL if the cursor has no more events. The multi event structure is defined as follows: ```c typedef struct{ const tdb *db; const tdb_event *event; uint64_t cursor_idx; } tdb_multi event; ``` `db` is a TrailDB handle to the TrailDB that contains this `event`. See [tdb_cursor_next](#tdb_cursor_next) for more details about `tdb_event`. Use the `db` handle to translate `event->items` to values. The `cursor_idx` index points to the array of cursors given in [tdb_multi_cursor_new](#tdb_multi_cursor_new). ### tdb_multi_cursor_next_batch An optimized version of [tdb_multi_cursor_next](#tdb_multi_cursor_next). Instead of returning a single event, this function returns an array of events with a single function call. ```c uint64_t tdb_multi_cursor_next_batch(tdb_multi_cursor *mcursor, tdb_multi_event *events, uint64_t max_events); ``` * `mcursor` a multi-cursor handle * `events` a pre-allocated array of `tdb_multi_event` structs * `max_events` size of the `events` array Returns the number of events added to `events`, at most `max_events`. If the value returned is 0, all events have been exhausted. See [tdb_multi_cursor_next](#tdb_multi_cursor_next) for the definition of `tdb_multi_event`. Note that the pointers in the `events` array are valid only until the next call to one of the multi-cursor functions. If you want to persist the underlying events, you should copy them to another data structure. ### tdb_multi_cursor_peek Return the next event, in the ascending timestamp order, from the underlying cursors without consuming it. ```c const tdb_multi_event *tdb_multi_cursor_peek(tdb_multi_cursor *mcursor); ``` * `mcursor` a multi-cursor handle See [tdb_multi_cursor_next](#tdb_multi_cursor_next) for the definition of `tdb_multi_event`. # Filter events An event filter is a boolean query over fields, expressed in [conjunctive normal form](http://en.wikipedia.org/wiki/Conjunctive_normal_form). Once [assigned to a cursor](#tdb_cursor_set_event_filter), only the subset of events that match the query are returned. See [technical overview](technical_overview/#return-a-subset-of-events-with-event-filte rs) for more information. ### tdb_event_filter_new Create a new event filter handle. ```c struct tdb_event_filter *tdb_event_filter_new(void) ``` Return NULL if memory allocation fails. ### tdb_event_filter_new_match_none Create a new event filter handle that is optimized to match no events. Commonly used to [create a view over a subset of trails](technical_overview/#whitelist-or-blacklist-trails-a-view-over-a-subset-of-trails). ```c struct tdb_event_filter *tdb_event_filter_new_match_none(void) ``` Return NULL if memory allocation fails. ### tdb_event_filter_new_match_all Create a new event filter handle that is optimized to match all events. Commonly used to [create a view over a subset of trails](technical_overview/#whitelist-or-blacklist-trails-a-view-over-a-subset-of-trails). ```c struct tdb_event_filter *tdb_event_filter_new_match_all(void) ``` Return NULL if memory allocation fails. ### tdb_event_filter_free Free an event filter handle. ```c void tdb_event_filter_free(struct tdb_event_filter *filter) ``` ### tdb_event_filter_add_term Add a term (item) in the query. This item is attached to the current clause with OR. You can make the item negative by setting `is_negative` to non-zero. ```c tdb_error tdb_event_filter_add_term(struct tdb_event_filter *filter, tdb_item term, int is_negative) ``` * `filter` filter handle. * `term` an item to be included in the clause. * `is_negative` is this item negative? Return 0 on success, an error code otherwise (out of memory). ### tdb_event_filter_add_time_range Add a time-range term to the query. This item is attached to the current clause with OR. Finds events with timestamp `t` such that `start_time <= t < end_time`. ```c tdb_error tdb_event_filter_add_time_range(struct tdb_event_filter *filter, uint64_t start_time, uint64_t end_time) ``` * `filter` filter handle. * `start_time` (inclusive) start of time range * `end_time` (exclusive) end of time range Return 0 on success, an error code otherwise (out of memory or invalid time range). ### tdb_event_filter_new_clause Add a new clause in the query. The new clause is attached to the query with AND. ```c tdb_error tdb_event_filter_new_clause(struct tdb_event_filter *filter) ``` * `filter` filter handle. Return 0 success, an error code otherwise (out of memory). ### tdb_event_filter_num_clauses Get the number of clauses in this filter. ```c uint64_t tdb_event_filter_num_clauses(const struct tdb_event_filter *filter); ``` * `filter` filter handle. Return the number of clauses. Note that a new filter has one clause by default, so the return value is always at least one. ### tdb_event_filter_num_terms Get the number of terms in a clause of this filter. ```c tdb_error tdb_event_filter_num_terms(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t *num_terms); ``` * `filter` filter handle. * `clause_index` clause index: `0 <= clause_index < tdb_event_filter_num_clauses()`. * `num_terms` returns the number of terms in the clause. Returns 0 (`TDB_ERR_OK`) if the given clause exists, otherwise `TDB_ERR_NO_SUCH_ITEM`. ### tdb_event_filter_get_term_type Get the time of a term in a clause in this filter. ```c tdb_error tdb_event_filter_get_term_type(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t terms_index, tdb_event_filter_term_type *term_type); ``` * `filter` filter handle. * `clause_index` clause index: `0 <= clause_index < tdb_event_filter_num_clauses()`. * `term_index` term index: `0 <= term_index < tdb_event_filter_num_terms()`. * `tdb_event_filter_term_type` returns the term type. If the term was found, then the function returns 0 and `tdb_event_filter_term_type` is either `TDB_EVENT_FILTER_MATCH_TERM` or `TDB_EVENT_FILTER_TIME_RANGE_TERM`. Otherwise, if the clause or term do not exist, the function returns `TDB_ERR_NO_SUCH_ITEM` and `tdb_event_filter_term_type` is `TDB_EVENT_FILTER_UNKNOWN_TERM`. ### tdb_event_filter_get_item Get an item added to this filter. ```c tdb_error tdb_event_filter_get_item(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t item_index, tdb_item *item, int *is_negative) ``` * `filter` filter handle. * `clause_index` clause index: `0 <= clause_index < tdb_event_filter_num_clauses()`. * `item_index` item index in the clause: `0 <= term_index < tdb_event_filter_num_terms()`. * `item` returned item. * `is_negative` return 1 if the item negative, 0 otherwise, as set in [tdb_event_filter_add_term](#tdb_event_filter_add_term). Returns 0 if an item was found at this location and is a match term. If the clause or term do no exist, `TDB_ERR_NO_SUCH_ITEM` is returned. Note that empty clauses always return `TDB_ERR_NO_SUCH_ITEM` although the clauses themselves are valid. Lastly, if you try to call `tdb_event_filter_get_item` on a time-range term, then `TDB_ERR_INCORRECT_TERM_TYPE` is returned. ### tdb_event_filter_get_time_range Get a time-range term from a clause in this filter. ```c tdb_error tdb_event_filter_get_time_range(const struct tdb_event_filter *filter, uint64_t clause_index, uint64_t term_index, uint64_t *start_time, uint64_t *end_time) ``` * `filter` filter handle. * `clause_index` clause index: `0 <= clause_index < tdb_event_filter_num_clauses()`. * `term_index` term index in the clause: `0 <= term_index < tdb_event_filter_num_terms()`. * `start_time` start time (inclusive) of the time range. * `end_time` end_time (exclusive) of the time range. Returns 0 if a time-range term was found at this location. If the clause or term do no exist, `TDB_ERR_NO_SUCH_ITEM` is returned. Note that empty clauses always return `TDB_ERR_NO_SUCH_ITEM` although the clauses themselves are valid. Lastly, if you try to call `tdb_event_filter_get_time_range` on a match term, then `TDB_ERR_INCORRECT_TERM_TYPE` is returned. Here is an example how to deconstruct a filter back to clauses and items: ```c for (clause = 0; clause < tdb_event_filter_num_clauses(filter); clause++){ uint64_t item, start_time, end_time, idx = 0; tdb_event_filter_term_type term_type; int is_negative; tdb_error ret; for (term = 0; term < tdb_event_filter_num_terms(filter, clause); term++){ tdb_event_filter_get_term_type(filter, clause, term, term_type); if (type == TDB_EVENT_FILTER_MATCH_TERM){ ret = tdb_event_filter_get_item(f, clause, term, &item, &is_negative); if (ret == TDB_ERR_OK){ /* do something with 'item' at 'term' in 'clause' */ } } else if(type == TDB_EVENT_FILTER_TIME_RANGE_TERM){ ret = tdb_event_filter_get_time_range(f, clause, term, &start_time, &end_time); if (ret == TDB_ERR_OK){ /* do something with 'start_time' and 'end_time' at 'term' in 'clause' */ } } } } ``` traildb-0.6+dfsg1/doc/docs/extra_css/0000700000175000017500000000000013106440271016762 5ustar czchenczchentraildb-0.6+dfsg1/doc/docs/extra_css/traildb.css0000600000175000017500000000077613106440271021131 0ustar czchenczchenhtml { font-size: 100%; } .wy-side-nav-search, .wy-nav-top { background-color: #00a7e1; } .wy-side-nav-search .icon-home, .wy-side-nav-search .icon-home:hover { text-indent: -99999em; background-image: url('http://traildb.io/images/tdb_logo@2x.png'); background-size: auto 100%; background-repeat: no-repeat; background-position: center; height: 5em; width: 10em; } .wy-menu .toctree-l4{ display: none; } h6{ margin-bottom: 0.5em; } code{ font-size: 80%; } traildb-0.6+dfsg1/doc/docs/extra_css/multilang.css0000600000175000017500000000106613106440271021475 0ustar czchenczchen*[data-multilang] { margin-bottom: 1em; } *[data-multilang] pre { margin: 0; } .multilang-header { overflow: hidden; margin-bottom: 0.5em; } .multilang-title { font-weight: bold; float: left; line-height: 32px; } .multilang-btn-group { float: right; } .multilang-btn-group .btn { border-radius: 0; } .multilang-btn-group .btn:first-child { border-top-left-radius: 2px; border-bottom-left-radius: 2px; } .multilang-btn-group .btn:last-child { border-top-right-radius: 2px; border-bottom-right-radius: 2px; } traildb-0.6+dfsg1/doc/docs/extra_javascript/0000700000175000017500000000000013106440271020340 5ustar czchenczchentraildb-0.6+dfsg1/doc/docs/extra_javascript/multilang.js0000600000175000017500000000455113106440271022701 0ustar czchenczchen$(function() { $('[data-multilang]').each(function() { var $header = $('
').addClass('multilang-header'); // Replace title attribute with visible title var $title = $('
').text(this.title).addClass('multilang-title'); $(this).removeAttr('title'); $header.append($title); // Create button for each code block var $codeBlocks = $(this).find('pre > code'), $btnGroup = $('
').addClass('multilang-btn-group'); $codeBlocks.each(function(index) { // Make sure block is highlighted if (!$(this).hasClass('hljs')) { hljs.highlightBlock(this); } // Parse language key from code block var lang = parseLang(this); var $btn = $('