javascript - How to create perfect hash with ASCII symbols as input, where output hash is always the same for each ASCII sequenc

时间: 2025-01-06 admin 业界

I have this code so far, which performs some kind of hashing. The goal is to map each ASCII string to a single Hangul Syllable unicode point:

const HANGUL_START = 0xAC00; // Start of Hangul Syllables block
const HANGUL_END = 0xD7A3; // End of Hangul Syllables block
const HANGUL_COUNT = HANGUL_END - HANGUL_START + 1; // 11,172 syllables

// Stable hash function
const hashString = (input) => {
    return Array.from(input).reduce(
        (hash, char) => hash * 31n + BigInt(char.charCodeAt(0)),
        0n
    );
};

// Map input strings to unique Hangul syllables
const mapToHangul = (inputs) => {
    // Compute hashes for all inputs
    const hashes = inputs.map((input) => ({
        input,
        hash: hashString(input),
    }));

    // Sort inputs by their hash values
    hashes.sort((a, b) => (a.hash < b.hash ? -1 : a.hash > b.hash ? 1 : 0));

    // Map each input to a unique Hangul syllable based on its rank
    const result = new Map();
    hashes.forEach((item, index) => {
        const hangulCode = HANGUL_START + (index % HANGUL_COUNT);
        result.set(item.input, String.fromCharCode(hangulCode));
    });

    return result;
};

// Example usage
const demonstrateMapping = () => {
    const inputs = ["Hello", "World", "ASCII", "Test123", "ExtraInput"];
    const mapping = mapToHangul(inputs);

    console.log("Mapping:");
    mapping.forEach((hangul, input) => {
        console.log(`${input} -> ${hangul}`);
    });
};

demonstrateMapping();

I have this code so far, which performs some kind of hashing. The goal is to map each ASCII string to a single Hangul Syllable unicode point:

const HANGUL_START = 0xAC00; // Start of Hangul Syllables block
const HANGUL_END = 0xD7A3; // End of Hangul Syllables block
const HANGUL_COUNT = HANGUL_END - HANGUL_START + 1; // 11,172 syllables

// Stable hash function
const hashString = (input) => {
    return Array.from(input).reduce(
        (hash, char) => hash * 31n + BigInt(char.charCodeAt(0)),
        0n
    );
};

// Map input strings to unique Hangul syllables
const mapToHangul = (inputs) => {
    // Compute hashes for all inputs
    const hashes = inputs.map((input) => ({
        input,
        hash: hashString(input),
    }));

    // Sort inputs by their hash values
    hashes.sort((a, b) => (a.hash < b.hash ? -1 : a.hash > b.hash ? 1 : 0));

    // Map each input to a unique Hangul syllable based on its rank
    const result = new Map();
    hashes.forEach((item, index) => {
        const hangulCode = HANGUL_START + (index % HANGUL_COUNT);
        result.set(item.input, String.fromCharCode(hangulCode));
    });

    return result;
};

// Example usage
const demonstrateMapping = () => {
    const inputs = ["Hello", "World", "ASCII", "Test123", "ExtraInput"];
    const mapping = mapToHangul(inputs);

    console.log("Mapping:");
    mapping.forEach((hangul, input) => {
        console.log(`${input} -> ${hangul}`);
    });
};

demonstrateMapping();

But, as I pieced together from various ideas online, I don't feel like it will handle the case where, at first say I have 1000 ASCII strings (from 1 to 5 characters let's say, each), and I compute the hashes. I get one set of hash values. Then I add 20 more ASCII strings and position them at arbitrary points in the ASCII strings array, then compute the hashes again. There will either be collisions (I can't have any collisions), or if I use a used cache (Map), then the hashes will depend on the order of the input, which I also don't want. A variant which has a sort of cache is here:

const HANGUL_START = 0xAC00; // Start of Hangul Syllables block
const HANGUL_END = 0xD7A3; // End of Hangul Syllables block
const HANGUL_COUNT = HANGUL_END - HANGUL_START + 1n; // 11,172 syllables

// Hash function: Purely maps a string to a bigint
const hashString = (input: string): bigint => {
    return Array.from(input).reduce(
        (hash, char) => hash * 31n + BigInt(char.charCodeAt(0)),
        0n
    );
};

// Maps a hash to a Hangul syllable deterministically
const mapHashToHangul = (hash: bigint): string => {
    const normalizedHash = Number(hash % HANGUL_COUNT);
    return String.fromCharCode(HANGUL_START + normalizedHash);
};

// Main function: Convert multiple strings into unique Hangul syllables
const hashMultipleToHangul = (inputs: string[]): Map<string, string> => {
    const usedCodes = new Set<number>(); // Track used Hangul syllables
    const resultMap = new Map<string, string>();

    for (const input of inputs) {
        const hash = hashString(input);
        let hangulCode = HANGUL_START + Number(hash % HANGUL_COUNT);

        // Resolve collisions by finding the next available Hangul code
        while (usedCodes.has(hangulCode)) {
            hangulCode = (hangulCode + 1 - HANGUL_START) % HANGUL_COUNT + HANGUL_START;
        }

        usedCodes.add(hangulCode);
        resultMap.set(input, String.fromCharCode(hangulCode));
    }

    return resultMap;
};

// Example usage
const inputs = [
    "Hello",
    "World",
    "ASCII",
    "Test123",
    "HelloWorld",
    "AnotherTest",
    "FinalString",
    "ExtraInput"
];

const results = hashMultipleToHangul(inputs);
console.log("Hashed Hangul results:");
for (const [input, hangul] of results.entries()) {
    console.log(`${input} -> ${hangul}`);
}

For reference, my input ASCII strings are the i field from this dynamically generated dataset (and the x: getSingleGlyph() is what I want to replace with something more deterministic):

import st from '@lancejpollard/script-tree'

// https://en.wikipedia.org/wiki/Hangul_Syllables
// first hangul jamo syllable
const Xi = parseInt('AC00', 16)
// last hangul jamo syllable
const Xf = parseInt('D7A3', 16)
let X = Xi

const m = {
  u: {
    grave: '\u0300',
    acute: '\u0301',
    dacute: '\u030B',
    dgrave: '\u030F',
    up: '\u0302',
    down: '\u030C',
    dot: '\u0307',
    ddot: '\u0308',
    ring: '\u030A',
    tilde: '\u0303',
    macron: '\u0304',
    hook: '\u0309',
  },
  d: {
    grave: '\u0316',
    acute: '\u0317',
    ring: '\u0325',
    dot: '\u0323',
    ddot: '\u0324',
    down: '\u032C',
    tilde: '\u0330',
    macron: '\u0331',
    cedilla: '\u0327',
    up: '\u032D',
    hook: '\u0328',
  },
}

const D: Record<string, string> = {
  '--': m.u.dgrave,
  '-': m.u.grave,
  '++': m.u.dacute,
  '+': m.u.acute,
  '//': `${m.d.hook}${m.u.dacute}`, // rising 2 (vietnamese ngã)
  '/': `${m.d.hook}${m.u.acute}`, // rising (vietnamese sắc)
  '\\/': m.u.down, // falling rising (vietnamese hỏi)
  '/\\': m.u.up, // rising falling
  '\\\\': `${m.d.hook}${m.u.dgrave}`, // falling 2 (vietnamese nặng)
  '\\': `${m.d.hook}${m.u.grave}`, // falling (vietnamese huyền)
  '^': m.u.dot, // accent/stress mark
  $: m.d.ddot,
  '&': m.d.tilde,
  _: m.u.macron, // long vowel
  '@': m.d.grave, // non-syllabic
  '!': m.d.macron, // short vowel
  '': '',
}

const G: Record<string, string> = {
  I: `i${m.d.dot}`,
  E: `e${m.d.dot}`,
  A: `a${m.d.dot}`,
  O: `o${m.d.dot}`,
  U: `u${m.d.dot}`,
  i: `i`,
  e: `e`,
  a: `a`,
  o: `o`,
  u: `u`,
}

export type Take = {
  i: string
  x: string
  o: string
  name?: string
  o2?: string
}

export const VOWELS: Array<Take> = []

export const BASE_VOWEL_GLYPHS = [
  'I',
  'E',
  'A',
  'O',
  'U',
  'i',
  'e',
  'a',
  'o',
  'u',
]
export const TONE_MARKS = [
  '--',
  '-',
  '++',
  '+',
  '/',
  '//',
  '\\/',
  '/\\',
  '\\\\',
  '\\',
  '',
]
export const VARIANT_MARKS = ['$', '']
export const NASAL_MARKS = ['&', '']
export const DURATION_MARKS = ['_', '!', '']
export const SYLLABIC_MARKS = ['@', '']
export const ACCENT_MARKS = ['^', '']

BASE_VOWEL_GLYPHS.forEach(g => {
  ACCENT_MARKS.forEach(a => {
    DURATION_MARKS.forEach(l => {
      SYLLABIC_MARKS.forEach(s => {
        NASAL_MARKS.forEach(n => {
          VARIANT_MARKS.forEach(v => {
            TONE_MARKS.forEach(t => {
              const i = `${g}${v}${n}${s}${t}${l}${a}`
              const x = g.match(/i/i) && a === '^' ? 'ï' : G[g]
              const y = x === 'ï' ? '' : D[a]
              const x2 = v === '$' && x === 'u' ? 'r' : x
              const v2 = v === '$' && g === 'u' ? '' : `${D[v]}`
              const o =
                l === '!'
                  ? `${x2}${y}${D[l]}${D[n]}${D[s]}${D[t]}${v2}`
                  : `${x2}${D[l]}${D[n]}${D[s]}${D[t]}${v2}${y}`
              VOWELS.push({ i, x: getSingleGlyph(), o })
            })
          })
        })
      })
    })
  })
})

export const SYMBOLS = [
  { i: '=.', x: getSingleGlyph(), o: '.' },
  { i: '=?', x: getSingleGlyph(), o: '?' },
  { i: '=!', x: getSingleGlyph(), o: '!' },
  { i: '=+', x: getSingleGlyph(), o: '+' },
  { i: '=-', x: getSingleGlyph(), o: '-' },
  { i: '>', x: getSingleGlyph(), o: '>' },
  { i: '<', x: getSingleGlyph(), o: '<' },
  { i: '/', x: getSingleGlyph(), o: '/' },
  { i: '\\', x: getSingleGlyph(), o: '\\' },
  { i: '|', x: getSingleGlyph(), o: '|' },
  { i: '(', x: getSingleGlyph(), o: '(' },
  { i: ')', x: getSingleGlyph(), o: ')' },
  { i: '[', x: getSingleGlyph(), o: '[' },
  { i: ']', x: getSingleGlyph(), o: ']' },
  { i: ' ', x: getSingleGlyph(), o: ' ' },
]

export const NUMERALS = [
  { i: '0', x: getSingleGlyph(), o: '0' },
  { i: '1', x: getSingleGlyph(), o: '1' },
  { i: '2', x: getSingleGlyph(), o: '2' },
  { i: '3', x: getSingleGlyph(), o: '3' },
  { i: '4', x: getSingleGlyph(), o: '4' },
  { i: '5', x: getSingleGlyph(), o: '5' },
  { i: '6', x: getSingleGlyph(), o: '6' },
  { i: '7', x: getSingleGlyph(), o: '7' },
  { i: '8', x: getSingleGlyph(), o: '8' },
  { i: '9', x: getSingleGlyph(), o: '9' },
]

export const CONSONANTS = [
  { i: '@', x: getSingleGlyph(), o: `@` },
  { i: 'h~', x: getSingleGlyph(), o: `ɦ` },
  { i: 'm', x: getSingleGlyph(), o: `m` },
  { i: 'N', x: getSingleGlyph(), o: `n${m.d.dot}` },
  { i: 'n', x: getSingleGlyph(), o: `n` },
  { i: 'q', x: getSingleGlyph(), o: `n${m.u.dot}` },
  { i: 'G~', x: getSingleGlyph(), o: `g${m.u.tilde}` },
  { i: 'G', x: getSingleGlyph(), o: `g${m.u.dot}` },
  { i: 'g?', x: getSingleGlyph(), o: `g${m.u.grave}` },
  { i: 'g', x: getSingleGlyph(), o: `g` },
  { i: "'", x: getSingleGlyph(), o: `'` },
  { i: 'Q', x: getSingleGlyph(), o: `q${m.u.dot}` },
  { i: 'd?', x: getSingleGlyph(), o: `d${m.d.grave}` },
  { i: 'd!', x: getSingleGlyph(), o: `d${m.d.acute}` },
  { i: 'd*', x: getSingleGlyph(), o: `d${m.d.down}` },
  { i: 'd.', x: getSingleGlyph(), o: `d${m.d.macron}` },
  { i: 'D', x: getSingleGlyph(), o: `d${m.d.dot}` },
  { i: 'dQ~', x: getSingleGlyph(), o: `d${m.d.tilde}` },
  { i: 'd', x: getSingleGlyph(), o: `d` },
  { i: 'b?', x: getSingleGlyph(), o: `b${m.d.grave}` },
  { i: 'b!', x: getSingleGlyph(), o: `b${m.d.acute}` },
  { i: 'b', x: getSingleGlyph(), o: `b` },
  { i: 'p!', x: getSingleGlyph(), o: `p${m.u.acute}` },
  { i: 'p*', x: getSingleGlyph(), o: `p${m.u.up}` },
  { i: 'p.', x: getSingleGlyph(), o: `t${m.u.macron}` },
  { i: 'p@', x: getSingleGlyph(), o: `x${m.u.down}` },
  { i: 'p', x: getSingleGlyph(), o: `p` },
  { i: 'T!', x: getSingleGlyph(), o: `t${m.d.dot}${m.d.acute}` },
  { i: 'T', x: getSingleGlyph(), o: `t${m.d.dot}` },
  { i: 't!', x: getSingleGlyph(), o: `t${m.d.acute}` },
  { i: 't*', x: getSingleGlyph(), o: `t${m.d.down}` },
  { i: 'tQ~', x: getSingleGlyph(), o: `t${m.d.tilde}` },
  { i: 't@', x: getSingleGlyph(), o: `t${m.d.up}` },
  { i: 't.', x: getSingleGlyph(), o: `t${m.d.macron}` },
  { i: 't', x: getSingleGlyph(), o: `t` },
  { i: 'k!', x: getSingleGlyph(), o: `k${m.d.acute}` },
  { i: 'k.', x: getSingleGlyph(), o: `k${m.d.macron}` },
  { i: 'k*', x: getSingleGlyph(), o: `k${m.d.down}` },
  { i: 'K!', x: getSingleGlyph(), o: `k${m.d.dot}${m.d.acute}` },
  { i: 'K', x: getSingleGlyph(), o: `k${m.d.dot}` },
  { i: 'k', x: getSingleGlyph(), o: `k` },
  { i: 'H!', x: getSingleGlyph(), o: `h${m.d.dot}${m.d.acute}` },
  { i: 'H', x: getSingleGlyph(), o: `h${m.d.dot}` },
  { i: 'h!', x: getSingleGlyph(), o: `ħ` },
  { i: 'h', x: getSingleGlyph(), o: `h` },
  { i: 'J', x: getSingleGlyph(), o: `ȷ̈` },
  { i: 'j!', x: getSingleGlyph(), o: `j${m.u.acute}` },
  { i: 'j', x: getSingleGlyph(), o: `j` },
  { i: 'S!', x: getSingleGlyph(), o: `s${m.d.dot}${m.u.acute}` },
  { i: 's!', x: getSingleGlyph(), o: `s${m.u.acute}` },
  { i: 'S', x: getSingleGlyph(), o: `s${m.d.dot}` },
  { i: 'sQ~', x: getSingleGlyph(), o: `s${m.d.tilde}` },
  { i: 's@', x: getSingleGlyph(), o: `s${m.d.up}` },
  { i: 's', x: getSingleGlyph(), o: `s` },
  { i: 'F', x: getSingleGlyph(), o: `f${m.d.dot}` },
  { i: 'f!', x: getSingleGlyph(), o: `f${m.d.acute}` },
  { i: 'f', x: getSingleGlyph(), o: `f` },
  { i: 'V', x: getSingleGlyph(), o: `v${m.d.dot}` },
  { i: 'v', x: getSingleGlyph(), o: `v` },
  { i: 'z!', x: getSingleGlyph(), o: `z${m.u.acute}` },
  { i: 'zQ~', x: getSingleGlyph(), o: `z${m.d.tilde}` },
  { i: 'z', x: getSingleGlyph(), o: `z` },
  { i: 'Z!', x: getSingleGlyph(), o: `z${m.d.dot}${m.u.acute}` },
  { i: 'Z', x: getSingleGlyph(), o: `z${m.d.dot}` },
  { i: 'CQ~', x: getSingleGlyph(), o: `c${m.d.dot}${m.u.tilde}` },
  { i: 'C', x: getSingleGlyph(), o: `c${m.d.dot}` },
  { i: 'cQ~', x: getSingleGlyph(), o: `c${m.u.tilde}` },
  { i: 'c', x: getSingleGlyph(), o: `c` },
  { i: 'L', x: getSingleGlyph(), o: `l${m.d.dot}` },
  { i: 'l*', x: getSingleGlyph(), o: `l${m.d.down}` },
  { i: 'lQ~', x: getSingleGlyph(), o: `l${m.d.tilde}` },
  { i: 'l', x: getSingleGlyph(), o: `l` },
  { i: 'R', x: getSingleGlyph(), o: `r${m.d.dot}` },
  { i: 'rQ~', x: getSingleGlyph(), o: `r${m.u.tilde}` },
  { i: 'r', x: getSingleGlyph(), o: `r${m.u.dot}` },
  { i: 'x!', x: getSingleGlyph(), o: `x${m.u.acute}` },
  { i: 'X!', x: getSingleGlyph(), o: `x${m.d.dot}${m.u.acute}` },
  { i: 'X', x: getSingleGlyph(), o: `x${m.d.dot}` },
  { i: 'x@', x: getSingleGlyph(), o: `x${m.d.up}` },
  { i: 'x', x: getSingleGlyph(), o: `x` },
  { i: 'W', x: getSingleGlyph(), o: `w${m.u.dot}` },
  { i: 'w!', x: getSingleGlyph(), o: `w${m.u.acute}` },
  { i: 'w~', x: getSingleGlyph(), o: `w${m.d.dot}` },
  { i: 'w', x: getSingleGlyph(), o: `w` },
  { i: 'y~', x: getSingleGlyph(), o: `y${m.u.dot}` },
  { i: 'y', x: getSingleGlyph(), o: `y` },
]

export const GLYPHS = [
  ...VOWELS,
  ...CONSONANTS,
  ...SYMBOLS,
  ...NUMERALS,
]

// eslint-disable-next-line @typescript-eslint/no-unsafe-assignment, @typescript-eslint/no-unsafe-member-access
const tree = st.fork(GLYPHS)
// eslint-disable-next-line @typescript-eslint/no-unsafe-return, @typescript-eslint/no-unsafe-member-access
const make = (text: string): string => st.form(text, tree)

make.inputs = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.i)

make.readableOutput = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.o)

make.readable = (text: string): string =>
  make.readableOutput(text).join('')

make.machineOutputs = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.x)

make.machine = (text: string): string =>
  make.machineOutputs(text).join('')

export default make

function getSingleGlyph() {
  if (X === Xf) {
    throw new Error(`Too many glyphs created`)
  }

  return String.fromCodePoint(X++)
}

How can I make a hashing function which takes my ASCII strings in my array (GLYPHS array above, in the i field), and maps it to a Hangul Jamo Syllable, in such a way that, if in the future I add more items to the GLYPHS array (in arbitrary spots), the old mappings will stay the same, and the new ones won't create any conflicts? Is it possible? If so, how to do? If it's not possible, why is it not possible? In that case I might have to resort to manually selecting Hangul glyphs by hand and setting them, but that will be a pain to do and pain to maintain. Would like a hashing solution so I don't need to do it by hand, but all the hashing solutions I've thought about so far seems to have the problem that if I add new items later, the hashing values will all change. Can that be avoided?

Note: There are about 11,000 Hangul Jamo Syllables in that unicode block. I will have much less than that number of input ASCII strings, so we won't run out of space.

Share Improve this question asked 17 hours ago Lance PollardLance Pollard 79k97 gold badges326 silver badges601 bronze badges 4
  • Maybe an army of function generators? – zer00ne Commented 15 hours ago
  • There are >800,000 3-character ASCII strings which guarantees that any hash function will have collisions on a target range of 11,000. – stark Commented 14 hours ago
  • @stark Right but I will only have slightly more than my last code snippet, which is probably well less than 2000. – Lance Pollard Commented 13 hours ago
  • 1 If you know in advance what strings you will encode, then use a dictionary so you don't have collisions. Otherwise, expect collisions. – stark Commented 5 hours ago
Add a comment  | 

1 Answer 1

Reset to default 0

Your requirements that:

  • hashes are stable against expansion of the key set;
  • the order in which keys are added doesn't affect the hash values; and
  • no collisions

... make it impossible to get what you want.

Because any key might be the first one you add to the key set, your function would have to be able to generate a unique hash for any key individually.

There are more possible keys than hashes, so that's not gonna happen.

最新文章