Title: Error when handling partial terminal · Issue #227 · structuredllm/syncode · GitHub
Open Graph Title: Error when handling partial terminal · Issue #227 · structuredllm/syncode
X Title: Error when handling partial terminal · Issue #227 · structuredllm/syncode
Description: I'm trying to use syncode to generate Java programs, but it fails when given a valid token. I simplify the case in the reproduction code. I use a simple test case to demonstrate the case. In the grammar, there are four types of terminals...
Open Graph Description: I'm trying to use syncode to generate Java programs, but it fails when given a valid token. I simplify the case in the reproduction code. I use a simple test case to demonstrate the case. In the gr...
X Description: I'm trying to use syncode to generate Java programs, but it fails when given a valid token. I simplify the case in the reproduction code. I use a simple test case to demonstrate the case. In th...
Opengraph URL: https://github.com/structuredllm/syncode/issues/227
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Error when handling partial terminal","articleBody":"I'm trying to use syncode to generate Java programs, but it fails when given a valid token. I simplify the case in the reproduction code.\n\nI use a simple test case to demonstrate the case. In the grammar, there are four types of terminals: `WS`, `NUMBER`, `BLOCK_COMMENT` and operators e.g. `+` or `*`. The grammar only allows simple operations, and both `WS` and `BLOCK_COMMENT` are ignored.\n\nWhen the text starts with `/**/`, which by itself is a legal `BLOCK_COMMENT`, it is divided by the tokenizer into `/**` and `/`. Then, if the `/**` token is selected by the model, because it is not a complete `BLOCK_COMMENT`, it is lexed into `/`, `*` and `*`, and then causes error like this.\n```text\n[YYYY-MM-DD hh:mm:ss,sss -syncode.grammar_mask.grammar_constrainer] - Parsing failed!\nPartial code: /**\nParsed lexical tokens: [Token('SLASH', '/')]\n[YYYY-MM-DD hh:mm:ss,sss -syncode.grammar_mask.grammar_constrainer] - --------------------------------------------------\nTraceback (most recent call last):\n File \".../lib/python3.11/site-packages/syncode/larkm/parsers/lalr_parser_state.py\", line 77, in feed_token\n action, arg = states[state][token.type]\n ~~~~~~~~~~~~~^^^^^^^^^^^^\nKeyError: 'SLASH'\n```\n\nThis is because, the terminals are lexed purely based on the visible partial code, but doesn't consider the possibility that the partial code could be a prefix of another terminal. When it fails to match one terminal in the partial code (`BLOCK_COMMENT` in this case), it turns to match another code. This can be seen from the following `match` method of the `Scanner` class who does the real lexing job for the partial code. This behavior is totally fine in the `lark` library, because the code is almost always complete, but it is not OK for a constrained decoding method like `syncode`.\nhttps://github.com/structuredllm/syncode/blob/8afb425570440afcaada961b2da483867d9f2ff8/syncode/larkm/lexer.py#L387-L391\n\n\n### Reproduction code\n```Python\nfrom syncode.grammar_mask.grammar_constrainer import GrammarConstrainer\nfrom syncode.parsers.grammars import Grammar\nfrom syncode.mask_store.byte_tokenizer import ByteTokenizer\n\nfrom transformers import AutoTokenizer\nimport torch\n\nclass FakeGrammar(Grammar):\n def __init__(self, name: str, ebnf: str):\n self.ebnf = ebnf\n self.name = name\n\ntokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B')\ngrammar = FakeGrammar('custom',\nr'''start: sum\nsum: sum (\"+\" | \"-\") term\nterm: term (\"*\" | \"/\") factor\nfactor: \"(\" sum \")\" | NUMBER\nNUMBER: /[0-9]+/\nBLOCK_COMMENT: /\\/\\*.*?\\*\\//s\n%ignore BLOCK_COMMENT\n%import common.WS\n''')\nbyte_tokenizer = ByteTokenizer(tokenizer)\ngrammar_constrainer = GrammarConstrainer(\n grammar=grammar,\n tokenizer=tokenizer, # type: ignore\n byte_tokenizer=byte_tokenizer,\n use_cache=True,\n parse_output_only=True,\n batch_size=1,\n dev_mode=True,\n parser='lalr',\n mode='grammar_strict',\n indent=False\n)\n\nvocab_size = len(tokenizer.get_vocab())\n\ndef process_logit(input_ids: list[int], logit: torch.Tensor) -\u003e torch.Tensor:\n return grammar_constrainer.mask_scores(torch.tensor([input_ids]), logit.unsqueeze(0)).squeeze(0) # type: ignore\n\ndef process_tokens(tokens: list[int]):\n for i in range(len(tokens)):\n logit = torch.zeros((vocab_size,), dtype=torch.float)\n visible_tokens = tokens[:i]\n masked_logit = process_logit(visible_tokens, logit)\n assert masked_logit[tokens[i]] != float('-inf'), f'token {i} ({tokens[i]}, {tokenizer.decode(tokens[i])!r}) is masked'\n\ntext = '/**/1+1'\ntokens = tokenizer.encode(text)\nprint(tokens)\nprocess_tokens(tokens)\n```","author":{"url":"https://github.com/Gompyn","@type":"Person","name":"Gompyn"},"datePublished":"2025-08-20T15:16:06.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":1},"url":"https://github.com/227/syncode/issues/227"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:328d246f-f6cf-f614-e70e-3fbd94470418 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | E846:95C88:FFFFBF:145AA51:69919145 |
| html-safe-nonce | f520de84c24668c150f05d09e349ece17ab45c1d04e3f9a696dcf8fc9d0813cf |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFODQ2Ojk1Qzg4OkZGRkZCRjoxNDVBQTUxOjY5OTE5MTQ1IiwidmlzaXRvcl9pZCI6IjgzNzgxNDQ5ODg5OTE2ODkwMjkiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 6632fd7995e58b95e647a044848c28d26c09809b91869b4ef43e2b0ebde28fef |
| hovercard-subject-tag | issue:3338548234 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/structuredllm/syncode/227/issue_layout |
| twitter:image | https://opengraph.githubassets.com/c21eb4067f47eeaa0fef47653e8cf6466231c9a92c2cb0b7a174841609974eee/structuredllm/syncode/issues/227 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/c21eb4067f47eeaa0fef47653e8cf6466231c9a92c2cb0b7a174841609974eee/structuredllm/syncode/issues/227 |
| og:image:alt | I'm trying to use syncode to generate Java programs, but it fails when given a valid token. I simplify the case in the reproduction code. I use a simple test case to demonstrate the case. In the gr... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | Gompyn |
| hostname | github.com |
| expected-hostname | github.com |
| None | 42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b |
| turbo-cache-control | no-preview |
| go-import | github.com/structuredllm/syncode git https://github.com/structuredllm/syncode.git |
| octolytics-dimension-user_id | 204232273 |
| octolytics-dimension-user_login | structuredllm |
| octolytics-dimension-repository_id | 687211074 |
| octolytics-dimension-repository_nwo | structuredllm/syncode |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 687211074 |
| octolytics-dimension-repository_network_root_nwo | structuredllm/syncode |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 848bc6032dcc93a9a7301dcc3f379a72ba13b96e |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width