Title: Index error in docker-compose article-relevance-prediction. · Issue #103 · NeotomaDB/MetaExtractor · GitHub
Open Graph Title: Index error in docker-compose article-relevance-prediction. · Issue #103 · NeotomaDB/MetaExtractor
X Title: Index error in docker-compose article-relevance-prediction. · Issue #103 · NeotomaDB/MetaExtractor
Description: Running the docker compose in the root directory I am now running into a new error with indices: simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction Starting metaextractor_article-relevance...
Open Graph Description: Running the docker compose in the root directory I am now running into a new error with indices: simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction ...
X Description: Running the docker compose in the root directory I am now running into a new error with indices: simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction ...
Opengraph URL: https://github.com/NeotomaDB/MetaExtractor/issues/103
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Index error in docker-compose article-relevance-prediction.","articleBody":"Running the docker compose in the root directory I am now running into a new error with indices:\r\n\r\n```\r\nsimon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction\r\nStarting metaextractor_article-relevance-prediction_1 ... done\r\nAttaching to metaextractor_article-relevance-prediction_1\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:08,781 - gdd_api_query.py:113 - get_new_gdd_articles - INFO - Querying by n_recent = 1000\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:09,379 - gdd_api_query.py:151 - get_new_gdd_articles - INFO - 1000 articles queried from GeoDeepDive (page 1).\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:09,379 - gdd_api_query.py:174 - get_new_gdd_articles - INFO - GeoDeepDive query completed.\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:09,854 - gdd_api_query.py:197 - get_new_gdd_articles - INFO - 1000 articles returned from GeoDeepDive.\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:12,763 - relevance_prediction_parquet.py:57 - crossref_extract - INFO - Running crossref_extract function.\r\narticle-relevance-prediction_1 | 2023-07-24 18:04:12,766 - relevance_prediction_parquet.py:77 - crossref_extract - INFO - Querying CrossRef API for article metadata.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,843 - relevance_prediction_parquet.py:98 - crossref_extract - INFO - CrossRef API query completed for 1000 articles.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,877 - relevance_prediction_parquet.py:164 - data_preprocessing - INFO - Prediction data preprocessing begin.\r\narticle-relevance-prediction_1 | /app/src/article_relevance/relevance_prediction_parquet.py:178: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`\r\narticle-relevance-prediction_1 | metadata_df.loc[valid_condition, 'has_abstract'] = metadata_df.loc[valid_condition, \"abstract\"].isnull()\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,896 - relevance_prediction_parquet.py:189 - data_preprocessing - INFO - Running article language imputation.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:201 - data_preprocessing - INFO - 81 articles require language imputation\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:203 - data_preprocessing - INFO - 81 cannot be imputed due to too short text metadata(title, subtitle and abstract less than 5 character).\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,905 - relevance_prediction_parquet.py:213 - data_preprocessing - INFO - Missing language imputation completed\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,906 - relevance_prediction_parquet.py:214 - data_preprocessing - INFO - After imputation, there are 1000 non-English articles in total excluded from the prediction pipeline.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:238 - data_preprocessing - INFO - 0 articles has missing feature and its relevance cannot be predicted.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:239 - data_preprocessing - INFO - Data preprocessing completed.\r\narticle-relevance-prediction_1 | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:257 - add_embeddings - INFO - Sentence embedding start.\r\nDownloading (…)2c72f/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00\u003c00:00, 3.53MB/s]\r\nDownloading (…)be7662c72f/README.md: 100%|██████████| 8.09k/8.09k [00:00\u003c00:00, 22.2MB/s]\r\nDownloading (…)7662c72f/config.json: 100%|██████████| 754/754 [00:00\u003c00:00, 2.62MB/s]\r\nDownloading pytorch_model.bin: 100%|██████████| 440M/440M [00:10\u003c00:00, 40.7MB/s] \r\nDownloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00\u003c00:00, 810kB/s]\r\nDownloading (…)2c72f/tokenizer.json: 100%|██████████| 717k/717k [00:00\u003c00:00, 5.29MB/s]\r\nDownloading (…)okenizer_config.json: 100%|██████████| 453/453 [00:00\u003c00:00, 1.59MB/s]\r\nDownloading (…)be7662c72f/vocab.txt: 100%|██████████| 228k/228k [00:00\u003c00:00, 3.56MB/s]\r\narticle-relevance-prediction_1 | No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/allenai_specter2. Creating a new one with MEAN pooling.\r\narticle-relevance-prediction_1 | 2023-07-24 18:11:09,041 - relevance_prediction_parquet.py:275 - add_embeddings - INFO - Sentence embedding completed.\r\narticle-relevance-prediction_1 | 2023-07-24 18:11:09,050 - relevance_prediction_parquet.py:294 - relevance_prediction - INFO - Prediction start.\r\narticle-relevance-prediction_1 | 2023-07-24 18:11:09,064 - relevance_prediction_parquet.py:307 - relevance_prediction - INFO - Running prediction for 0 articles.\r\narticle-relevance-prediction_1 | Traceback (most recent call last):\r\narticle-relevance-prediction_1 | File \"/app/src/article_relevance/relevance_prediction_parquet.py\", line 456, in \u003cmodule\u003e\r\narticle-relevance-prediction_1 | main()\r\narticle-relevance-prediction_1 | File \"/app/src/article_relevance/relevance_prediction_parquet.py\", line 445, in main\r\narticle-relevance-prediction_1 | predicted = relevance_prediction(embedded, model_path, predict_thld = 0.5)\r\narticle-relevance-prediction_1 | File \"/app/src/article_relevance/relevance_prediction_parquet.py\", line 311, in relevance_prediction\r\narticle-relevance-prediction_1 | nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 1067, in __getitem__\r\narticle-relevance-prediction_1 | return self._getitem_tuple(key)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 1256, in _getitem_tuple\r\narticle-relevance-prediction_1 | return self._getitem_tuple_same_dim(tup)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 924, in _getitem_tuple_same_dim\r\narticle-relevance-prediction_1 | retval = getattr(retval, self.name)._getitem_axis(key, axis=i)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 1301, in _getitem_axis\r\narticle-relevance-prediction_1 | return self._getitem_iterable(key, axis=axis)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 1239, in _getitem_iterable\r\narticle-relevance-prediction_1 | keyarr, indexer = self._get_listlike_indexer(key, axis)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py\", line 1432, in _get_listlike_indexer\r\narticle-relevance-prediction_1 | keyarr, indexer = ax._get_indexer_strict(key, axis_name)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py\", line 6070, in _get_indexer_strict\r\narticle-relevance-prediction_1 | self._raise_if_missing(keyarr, indexer, axis_name)\r\narticle-relevance-prediction_1 | File \"/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py\", line 6133, in _raise_if_missing\r\narticle-relevance-prediction_1 | raise KeyError(f\"{not_found} not in index\")\r\narticle-relevance-prediction_1 | KeyError: \"['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578', '579', '580', '581', '582', '583', '584', '585', '586', '587', '588', '589', '590', '591', '592', '593', '594', '595', '596', '597', '598', '599', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '610', '611', '612', '613', '614', '615', '616', '617', '618', '619', '620', '621', '622', '623', '624', '625', '626', '627', '628', '629', '630', '631', '632', '633', '634', '635', '636', '637', '638', '639', '640', '641', '642', '643', '644', '645', '646', '647', '648', '649', '650', '651', '652', '653', '654', '655', '656', '657', '658', '659', '660', '661', '662', '663', '664', '665', '666', '667', '668', '669', '670', '671', '672', '673', '674', '675', '676', '677', '678', '679', '680', '681', '682', '683', '684', '685', '686', '687', '688', '689', '690', '691', '692', '693', '694', '695', '696', '697', '698', '699', '700', '701', '702', '703', '704', '705', '706', '707', '708', '709', '710', '711', '712', '713', '714', '715', '716', '717', '718', '719', '720', '721', '722', '723', '724', '725', '726', '727', '728', '729', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '740', '741', '742', '743', '744', '745', '746', '747', '748', '749', '750', '751', '752', '753', '754', '755', '756', '757', '758', '759', '760', '761', '762', '763', '764', '765', '766', '767'] not in index\"\r\nmetaextractor_article-relevance-prediction_1 exited with code 1\r\n```\r\n\r\nWhich seems to be coming from this line:\r\n\r\n```python\r\nnan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)\r\n```\r\n\r\nin the `relevance_prediction()` function.\r\n\r\nI'll try a bit of debugging to see why it's popping up.","author":{"url":"https://github.com/SimonGoring","@type":"Person","name":"SimonGoring"},"datePublished":"2023-07-24T18:16:21.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":2},"url":"https://github.com/103/MetaExtractor/issues/103"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:5cb3860d-970b-6262-1a9b-0edb3eaa2be9 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | CF62:2EC9F9:851851:B57D01:698F2AAB |
| html-safe-nonce | ec4fc8684d511d30eafd489bf351ff4414080677951849a50e15f24318684b41 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJDRjYyOjJFQzlGOTo4NTE4NTE6QjU3RDAxOjY5OEYyQUFCIiwidmlzaXRvcl9pZCI6IjM4NjEyMzgyOTQ1MTI2MDk5NjMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 9a0393510f3d2b8ed20646f5ed77ce70934fe3bbd3e0a7b9d572ab7fec07371c |
| hovercard-subject-tag | issue:1818904409 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/NeotomaDB/MetaExtractor/103/issue_layout |
| twitter:image | https://opengraph.githubassets.com/b167d1f69f49c475f5794bb3199b19127cfcc033418552ff8bff52011fa5e513/NeotomaDB/MetaExtractor/issues/103 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/b167d1f69f49c475f5794bb3199b19127cfcc033418552ff8bff52011fa5e513/NeotomaDB/MetaExtractor/issues/103 |
| og:image:alt | Running the docker compose in the root directory I am now running into a new error with indices: simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction ... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | SimonGoring |
| hostname | github.com |
| expected-hostname | github.com |
| None | 6102991c714c1a6a27d05fb4f96ea6ca87a6750b4f093b95cc57ed1b84b145a1 |
| turbo-cache-control | no-preview |
| go-import | github.com/NeotomaDB/MetaExtractor git https://github.com/NeotomaDB/MetaExtractor.git |
| octolytics-dimension-user_id | 19538006 |
| octolytics-dimension-user_login | NeotomaDB |
| octolytics-dimension-repository_id | 638558780 |
| octolytics-dimension-repository_nwo | NeotomaDB/MetaExtractor |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 638558780 |
| octolytics-dimension-repository_network_root_nwo | NeotomaDB/MetaExtractor |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | d52a41ad9b9ffb9b63a1b4a600a6054be8b70b36 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width